* WIP get old versions * 🏦 Add URL and data dict to Version model * 🔧 Some refactoring how we get tag data * First pass at uploading versions * 🔧 Move parsing logic to the parser and add tests * Retrieve url and save json data * Improve version upload docs * Linter
9.0 KiB
Syncing Data about Boost Versions and Libraries with GitHub
About
The data in our database generally originates from somewhere in the Boost GitHub ecosystem.
This page will explain to Django developers how data is synced from GitHub to our database.
- Most code is in
libraries/github.pyandlibraries/tasks.py
Release data
- Releases are also called "Versions."
- The model that saves Release/Version data is
versions/models.py::Version - We retrieve all the non-beta and non-release-candidate tags from the main Boost repo
Boost releases some tags as formal GitHub "releases," and these show up on the Releases tab.
Not all tags are official GitHub Releases, however, and this impacts where we get metadata about the tag.
To retrieve releases and tags, run:
./manage.py import_releases
This will:
- Delete existing Versions and LibraryVersions
- Retrieve tags and releases from the Boost GitHub repo
- Create new Versions for each tag and release that is not a beta or rc release
- Create a new LibraryVersion for each Library but not for historical versions
Library data
- Once a month, the task
libraries/tasks/update_libraries()runs. - It cycles through all Boost libraries and updates data
- It only handles the most recent version of Boost and does not handle older versions yet.
- There are methods to download issues and PRs, but the methods to download issues and PRs are not currently called.
Tasks or Questions
- A new GitHub API needs to be generated through the CPPAlliance GitHub organization, and be added as a kube secret
self.skip_modules: This exists in bothGitHubAPIClientandLibraryUpdaterbut it should probably only exist inLibraryUpdater, to keepGitHubAPIClientless tightly coupled to specific repos- If we only want aggregate data for issues and PRs, do we need to save the data from them in models, or can we just save the aggregate data somewhere?
Glossary
To make the code more readable to the Boost team, who will ultimately maintain the project, we tried to replicate their terminology as much as possible.
- Library: Boost “Libraries” correspond to GitHub repositories
.gitmodules: The file in the main Boost project repo that contains the information on all the repos that are considered Boost libraries- module and submodule: Other words for library that correspond more specifically to GitHub data
How it Works
LibraryUpdater
This is not a code walkthrough, but is a general overview of the objects and data that this class retrieves.
- The Celery task
libraries/tasks.py/update_librariesrunsLibraryUpdater.update_libraries() - This class uses the
GitHubAPIClientclass to call the GitHub API - It retrieves the list of libraries to update from the
.gitmodulesfile in the main Boost repo: https://github.com/boostorg/boost/blob/master/.gitmodules - From that list, it makes sure to exclude any libraries in
self.skip_modules. The modules inself.skipped_submodulesare not imported into the database. - For each remaining library:
- It uses the information from the
.gitmodulesfile to call the GitHub API for that specific library - It downloads the
meta/libraries.jsonfile for that library and parses that data - It uses the parsed data to add or update the Library record in our database for that GitHub repo
- It adds the library to the most recent Version object to create a LibraryVersion record, if needed
- The library categories are updated
- The maintainers are updated and stub Users are added for them if needed.
- The authors are updated and stub Users are added for them if needed (updated second because maintainers are more likely to have email addresses, so matching is easier).
- It uses the information from the
GithubAPIClient
- This class controls the requests to and responses from the GitHub API. Mostly a wrapper around
GhApithat allows us to set some default values to make calling the methods easier, and allows us to retrieve some data that is very specific to the Boost repos - Requires the environment variable
GITHUB_TOKENto be set - Contains methods to retrieve the
.gitmodulesfile, retrieve the.libraries.jsonfile, general repo data, repo issues, repo PRs, and the git tree.
GithubDataParser
- Contains methods to parse the data we retrieve from GitHub into more useful formats
- Contains methods to parse the
.gitmodulesfile and thelibraries.jsonfile, and to extract the author and maintainer names and email addresses, if present.
Attributes
| owner | GitHub repo owner | boostorg |
|---|---|---|
| ref | GitHub branch or tag to use on that repo | heads/master |
| repo_slug | GitHub repo slug | default |
self.skip_modules: This is the list of modules/libraries from.gitmodulesthat we do not download
GitHub Data
- Each Boost Library has a GitHub repo.
- Most of the time, one library has one repo. Other times, one GitHub repo is shared among multiple libraries (the “Algorithm” library is an example).
- The most important file for each Boost library is
meta/libraries.json
.gitmodules
This is the most important file in the main Boost repository. It contains the GitHub information for all Libraries included in that tagged Boost version, and is what we use to identify which Libraries to download into our database.
submodule: Corresponds to thekeyinlibraries.json- Contains information for the top-level Library, but not other sub-libraries stored in the same repo
path: the path to navigate to the Library repo from the main Boost repourl: the URL for the.gitrepo for the library, in relative terms (../system.git)fetchRecurseSubmodules: We don’t use this fieldbranch: We don’t use this field
libraries.json
This is the most important file in the GitHub repo for a library. It is where we retrieve all the metadata about the Library. It is the source of truth.
key: The GitHub slug, and the slug we use for our Library object- When the repo hosts a single Library, the
keycorresponds to thesubmodulein the main Boost repo’slibraries.jsonfile. Example:"key": "asio" - When the repo hosts multiple libraries, the first
keycorresponds to thesubmodule. Example:"key": "algorithm". Then, the following keys inlibraries.jsonwill be prefixed with the originalkeybefore adding their own slug. Example:"key": "algorithm/minimax"
- When the repo hosts a single Library, the
name: What we save as the Library nameauthors: A list of names of original authors of the Library’s documentation.- Data is very unlikely to change
- Data generally does not contain emails
- Stub users are creates for authors with fake email addresses and users will be able to claim those accounts.
description: What we save as theLibrarydescriptioncategory: A list of category names. We use this to attach Categories to the Libraries.maintainers: A list of names and emails of current maintainers of this Library- Data may change between versions
- Data generally contains emails
- Stub users are created for all maintainers. We use fake email addresses if an email address is not present
- We try to be smart — if the same name shows up as an author and a maintainer, we won’t create two fake records. But it’s imperfect.
cxxstd: C++ version in which this Library was added
Example with a single library:
Example with multiple libraries:
General Maintenance Notes
How to change the skipped libraries
- To add a new skipped submodule: add the name of the submodule to the list
self.skipped_modulesand make a PR. This will not remove the library from the database, but it will stop refreshing data for that library. - To remove a submodule that is currently being skipped: remove the name of the submodule from
self.skipped_modulesand make a PR. The library will be added to the database the next time the update runs.
How to delete Libraries
- Via the Admin. The Library update process does not delete any records.
How to add new Categories
- They will be automatically added as part of the download process as soon as they are added to a library's
libraries.jsonfile.
How to remove authors or maintainers
- Via the Admin.
- But if they are not also removed from the
libraries.jsonfile for the affected library, then they will be added back the next time the job runs.