Student Work

Mining graph patterns in software development from code repositories

Öffentlich Deposited

Herunterladbarer Inhalt

open in viewer

GitHub is a cloud-based version control system built on the Git platform and is used by over 100 million users globally. When viewing a repository on GitHub two important features are notably missing on a repository page. The first is a complexity metric, indicating how difficult it would be for a new user to pick up and start working on the project. The second is whether or not a repository will be active in the future, sometimes this information can be found in a project’s ReadMe or by analyzing recent commits but there is no single indicator. These two factors are very important when determining whether or not to use the code in a GitHub Repository or when importing a library from GitHub into a project. This project aims to use machine learning models to estimate complexity metrics and determine a repository’s status based on data retrieved from the GitHub API without having to download the repository to your local computer. This was done by generating graphs of repositories with files being nodes, edges connecting files committed together and the weights of those edges being the number of times two files were committed together. Encodings of the features of these nodes were created with a Graph Auto Encoder to be used as features. It was found that with the models used graph encodings provided no benefit to predictive models.

  • This report represents the work of one or more WPI undergraduate students submitted to the faculty as evidence of completion of a degree requirement. WPI routinely publishes these reports on its website without editorial or peer review.
  • E-project-040124-152208
  • 120315
  • 2024
UN Sustainable Development Goals
Date created
  • 2024-04-01
Resource type
  • E-project-040124-152208
Rights statement


In Collection:



Permanent link to this page: