A portrait of VSCode's external contributors
We build a data analysis pipeline to explore large GitHub repos. The same method will work with data from any Steampipe plugin.
In Using Steampipe's GitHub plugin to connect with your open source community we mined the tables provided by that plugin to explore our own organization's repos. That article shows how to find stale issues, visualize release cadence, and identify external contributors.
As a follow-on to that article we decided to ask and answer a question we've often wondered about: Who are the external contributors to
microsoft/vscode, and what are their patterns of contributions? This is an enormous repo: 1500 contributors, 88,500 commits, and 121,000 issues. Steampipe can handle lots of data, but joins involving this many contributors, commits, and issues blew out the memory on a EC2 t2.medium, and then a t2.large. So we switched to another strategy.
A query pipeline
The solution was to cache the big tables: commits (88K) and issues (121K). Then we could then derive a series of tables based on those, plus other tables joined on the fly. It should take you about an hour to run the full pipeline. While it's running, you can
select * from vscode_log to check progress. Here are all the tables you'll end up with.
2021-10-18 23:07:03.061311 | vscode_org_members2021-10-18 23:07:17.702559 | vscode_commits2021-10-18 23:13:07.961552 | vscode_committers2021-10-18 23:13:08.11581 | vscode_committer_details2021-10-18 23:17:35.505082 | vscode_internal_committers2021-10-18 23:17:35.563092 | vscode_internal_commits2021-10-18 23:17:36.267878 | vscode_internal_commit_counts2021-10-18 23:17:36.541539 | vscode_external_committers2021-10-18 23:17:36.579981 | vscode_external_commits2021-10-18 23:17:36.757549 | vscode_external_commit_counts2021-10-18 23:17:36.766478 | vscode_issues2021-10-19 00:02:38.071931 | vscode_issue_filers2021-10-19 00:02:39.090606 | vscode_issue_filer_details2021-10-19 00:02:39.094259 | vscode_internal_issue_filers2021-10-19 00:02:39.220729 | vscode_internal_issues2021-10-19 00:02:39.701516 | vscode_internal_issue_counts2021-10-19 00:02:39.735744 | vscode_external_issue_filers2021-10-19 00:02:39.893828 | vscode_external_issues2021-10-19 00:02:40.835652 | vscode_external_issue_counts2021-10-19 00:02:41.169555 | vscode_external_contributors2021-10-19 00:02:41.349196 | vscode_external_commit_timelines
Each phase of the pipeline produces a table, which helps make things robust to failure. While developing this pipeline, for example, we triggered the hourly API rate limit several times, but were able to pick up where we left off. None of this is specific to Steampipe, it's just a good way to do analytics based on API-sourced data.
To run the pipeline for yourself, try the SQL script you can see in the Show Me the SQL appendix at the end.
Topline stats for committers
Of 1502 committers, we classifed 1348 as external (leaving 153 internal) based on these rules:
- Their logins don't appear in the members field of the
github_organizationwhose name is microsoft.
- Their company and email columns, from
github_user, don't match microsoft.
- Their logins match a handful of enumerated exceptions.
Topline stats for issue authors
Of 52,028 issue authors, we classifed 51,9123 (leaving 382 as internal) based on similar rules. It was impractical to fetch the company and email info for all 52,028, that would take 10 hours at 5K API calls/hr. So we used committer data as a proxy for issue data when excluding Microsoft-affiliated issue authors. That's less precise, but the primary filter -- membership in Microsoft's GitHub organization -- is pretty good.
There were 116,241 closed issues, and 5400 open issues.
The averge length of time between opening and closing issues was about the same for issues filed by internal authors (57 days) and external authors (54 days).
Histograms for the two categories confirm that while the absolute numbers differ, the pattern is again very similar.
Commits vs issues for external contributors
This chart compares the number of commits and issues for each external contributor. There are some intriguing outliers: check out DanTup over there on the far right of the x axis, with 1 commit and 466 issues!
Top 10 external committers and issue authors
These are the top 10 external committers.
These are the top 10 external issue authors.
Top 10 internal committers and issue authors
For comparison, these are the top 10 internal committers.
These are the top 10 internal issue authors.
Activity timespans for external committers
This chart shows the first and last commit dates for each external committer.
These are the external committers with the longest spans.
author_login | days----------------|---------yume-chan | 1517Huachao | 1385forivall | 1367Ikuyadeu | 1353akosyakov | 1320jeanp413 | 1303JoshuaKGoldberg | 1135usernamehw | 111871 | 1102
Internal and external issue tags
These are the top 10 tags for issues filed by internal authors.
tag | count--------------------|---------verified | 9204bug | 9167feature-request | 3342insiders-released | 2459debug | 1951*duplicate | 1782debt | 1594terminal | 1524testplan-item | 1158verification-needed | 1105
These are the top 10 tags for issues filed by external authors.
tag | count---------------------|---------needs more info | 18448*duplicate | 15250feature-request | 11530bug | 10144verified | 8533*caused-by-extension | 5493terminal | 4034*question | 3887debug | 3863upstream | 3336
Beyond the GitHub API
What you've seen here isn't specific to the VSCode repo, it'll work for any large GitHub repo. If you try the pipeline shown here -- perhaps modifying it to answer some of your own questions, or pointing it at another repo -- let us know how it goes.
We think this pipeline will interest people who care deeply about, and study, open source community dynamics. What if your game is something entirely different? Well, if it revolves around compliance with cloud security recommendations, we've got you covered. That's a sweet spot for Steampipe that we'll explore in future posts.
More broadly, the methods shown here can apply across the whole set of Steampipe plugins. If you find yourself reaching for APIs to answer questions in any of those domains, give Steampipe a try and again let us know what you find.
Show me the SQL
- Install Steampipe: https://steampipe.io/downloads
- Install the GitHub plugin:
steampipe plugin install github
- Configure the GitHub plugin with your personal access token: https://hub.steampipe.io/plugins/turbot/github#credentials
psqlfor Postgres 12: https://www.compose.com/articles/postgresql-tips-installing-the-postgresql-client/
- Save this script as
steampipe service start
psql -h localhost -p 9193 -d steampipe -U steampipe < vscode.sql
You can also find this script in the steampipe-samples repo, along with a tool that can generate versions of the script to analyze other repos in the same way.
drop table if exists vscode_log;create table vscode_log(time timestamp, event text);drop table if exists vscode_org_members;insert into vscode_log(time, event) values (now(), 'vscode_org_members');create table vscode_org_members as (selectg.name,g.login,jsonb_array_elements_text(g.member_logins) as member_loginfromgithub_organization gwhereg.login = 'microsoft');drop table if exists vscode_commits;insert into vscode_log(time, event) values (now(), 'vscode_commits');create table vscode_commits as (selectg.repository_full_name,g.author_login,g.author_date,g.commit->'author'->>'email' as author_email,g.committer_login,g.committer_datefromgithub_commit gwhereg.repository_full_name = 'microsoft/vscode');drop table if exists vscode_committers;insert into vscode_log(time, event) values (now(), 'vscode_committers');create table vscode_committers as (with unordered as (select distinctc.repository_full_name,c.author_loginfromvscode_commits c)select*fromunorderedorder bylower(author_login));drop table if exists vscode_committer_details;insert into vscode_log(time, event) values (now(), 'vscode_committer_details');create table vscode_committer_details as (selectg.login,g.name,g.company,g.email,g.twitter_usernamefromgithub_user gjoinvscode_committers conc.author_login = g.login);drop table if exists vscode_internal_committers;insert into vscode_log(time, event) values (now(), 'vscode_internal_committers');create table vscode_internal_committers as (with by_membership as (select*fromvscode_committers cjoinvscode_org_members oonc.author_login = o.member_loginorder byc.author_login),by_vscode_committer_details as (select*fromvscode_committer_details cdwherecd.company ~* 'microsoft' or cd.email ~* 'microsoft'order bycd.login),combined as (selectm.author_login as m_login,cd.login as c_loginfromby_membership mfull joinby_vscode_committer_details cdonm.author_login = cd.login),merged as (selectcasewhen m_login is null then c_loginelse m_loginend as author_loginfromcombined)select*frommergedorder bylower(author_login));drop table if exists vscode_internal_commits;insert into vscode_log(time, event) values (now(), 'vscode_internal_commits');create table vscode_internal_commits as (select*fromvscode_commits cjoinvscode_internal_committers iusing(author_login));drop table if exists vscode_internal_commit_counts;insert into vscode_log(time, event) values (now(), 'vscode_internal_commit_counts');create table vscode_internal_commit_counts as (selecti.repository_full_name,i.author_login,count(*)fromvscode_internal_commits igroup byi.repository_full_name,i.author_loginorder bycount desc);drop table if exists vscode_external_committers;insert into vscode_log(time, event) values (now(), 'vscode_external_committers');create table vscode_external_committers as (select*fromvscode_committers cwhere not exists (select*fromvscode_internal_committers iwherec.author_login = i.author_loginor c.author_login = any ( array ['octref','eamodio'] ))order byc.author_login);drop table if exists vscode_external_commits;insert into vscode_log(time, event) values (now(), 'vscode_external_commits');create table vscode_external_commits as (select*fromvscode_commits cjoinvscode_external_committers iusing(repository_full_name, author_login));drop table if exists vscode_external_commit_counts;insert into vscode_log(time, event) values (now(), 'vscode_external_commit_counts');create table vscode_external_commit_counts as (selecte.repository_full_name,e.author_login,count(*)fromvscode_external_commits egroup bye.repository_full_name,e.author_loginorder bycount desc);drop table if exists vscode_issues;insert into vscode_log(time, event) values (now(), 'vscode_issues');create table vscode_issues as (selectrepository_full_name,author_login,issue_number,title,created_at,closed_at,state,comments,tagsfromgithub_issuewhererepository_full_name = 'microsoft/vscode');drop table if exists vscode_issue_filers;insert into vscode_log(time, event) values (now(), 'vscode_issue_filers');create table vscode_issue_filers as (with unordered as (select distincti.repository_full_name,i.author_loginfromvscode_issues i)select*fromunorderedorder bylower(author_login));-- insert into vscode_log(time, event) values (now(), 'vscode_issue_filer_details');-- create table vscode_issue_filer_details as (---- impractical for vscode's 52K issue authors at 5K API calls/hr!'----);drop table if exists vscode_internal_issue_filers;insert into vscode_log(time, event) values (now(), 'vscode_internal_issue_filers');create table vscode_internal_issue_filers as (select*fromvscode_issue_filers ijoinvscode_org_members ooni.author_login = o.member_loginorder byi.author_login);drop table if exists vscode_internal_issues;insert into vscode_log(time, event) values (now(), 'vscode_internal_issues');create table vscode_internal_issues as (selecti.repository_full_name,lower(i.author_login) as author_login,i.issue_number,i.created_at,i.closed_at,i.comments,i.state,i.title,i.tagsfromvscode_issues ijoinvscode_internal_issue_filers ifoni.author_login = if.author_loginand i.repository_full_name = if.repository_full_nameorder by author_login);drop table if exists vscode_internal_issue_counts;insert into vscode_log(time, event) values (now(), 'vscode_internal_issue_counts');create table vscode_internal_issue_counts as (selecti.repository_full_name,i.author_login,count(*)fromvscode_internal_issues igroup byi.repository_full_name,i.author_loginorder bycount desc);drop table if exists vscode_external_issue_filers;insert into vscode_log(time, event) values (now(), 'vscode_external_issue_filers');create table vscode_external_issue_filers as (with unfiltered as (select*fromvscode_issue_filers i-- use vscode_internal_committers as a proxy for vscode_internal_issue_filers, which-- would require 52K github_user calls (at 5K/hr)where not exists (select*fromvscode_internal_committers cwherec.author_login = i.author_login)order byi.author_login)select*fromunfiltered uwherenot u.author_login = any ( array ['ghost', 'octref', 'vscodeerrors', 'eamodio'] ));drop table if exists vscode_external_issues;insert into vscode_log(time, event) values (now(), 'vscode_external_issues');create table vscode_external_issues as (select*fromvscode_issues ijoinvscode_external_issue_filers eusing(repository_full_name, author_login));drop table if exists vscode_external_issue_counts;insert into vscode_log(time, event) values (now(), 'vscode_external_issue_counts');create table vscode_external_issue_counts as (selecte.repository_full_name,e.author_login,count(*)fromvscode_external_issues egroup bye.repository_full_name,e.author_loginorder bycount desc);drop table if exists vscode_external_contributors;insert into vscode_log(time, event) values (now(), 'vscode_external_contributors');create table vscode_external_contributors as (selectc.repository_full_name,c.author_login,c.count as vscode_commits,'https://github.com/microsoft/vscode/commits?author=' || c.author_login as commits_url,i.count as vscode_issues,'https://github.com/microsoft/vscode/issues?q=author:' || c.author_login as issues_url,cd.name,cd.company,cd.twitter_usernamefromvscode_external_commit_counts cfull joinvscode_external_issue_counts iusing(repository_full_name, author_login)joinvscode_committer_details cdonc.author_login = cd.loginorder bylower(c.author_login));drop table if exists vscode_external_commit_timelines;insert into vscode_log(time, event) values (now(), 'vscode_external_commit_timelines');create table vscode_external_commit_timelines as (with data as (selecte.repository_full_name,e.author_login,min(c.author_date) as first,max(c.author_date) as lastfromvscode_external_contributors ejoinvscode_commits cusing (repository_full_name, author_login)group bye.repository_full_name, e.author_login)selectrepository_full_name,author_login,to_char(first, 'YYYY-MM-DD') as first,to_char(last, 'YYYY-MM-DD') as lastfromdata dwhered.first != d.lastorder byfirst, last);