How To

A portrait of VSCode's external contributors

We build a data analysis pipeline to explore large GitHub repos. The same method will work with data from any Steampipe plugin.

Jon Udell
10 min. read - Oct 26, 2021
We build a data analysis pipeline to explore large GitHub repos. The same method will work with data from any Steampipe plugin.

In Using Steampipe's GitHub plugin to connect with your open source community we mined the tables provided by that plugin to explore our own organization's repos. That article shows how to find stale issues, visualize release cadence, and identify external contributors.

As a follow-on to that article we decided to ask and answer a question we've often wondered about: Who are the external contributors to microsoft/vscode, and what are their patterns of contributions? This is an enormous repo: 1500 contributors, 88,500 commits, and 121,000 issues. Steampipe can handle lots of data, but joins involving this many contributors, commits, and issues blew out the memory on a EC2 t2.medium, and then a t2.large. So we switched to another strategy.

A query pipeline

The solution was to cache the big tables: commits (88K) and issues (121K). Then we could then derive a series of tables based on those, plus other tables joined on the fly. It should take you about an hour to run the full pipeline. While it's running, you can select * from vscode_log to check progress. Here are all the tables you'll end up with.

2021-10-18 23:07:03.061311 | vscode_org_members
2021-10-18 23:07:17.702559 | vscode_commits
2021-10-18 23:13:07.961552 | vscode_committers
2021-10-18 23:13:08.11581 | vscode_committer_details
2021-10-18 23:17:35.505082 | vscode_internal_committers
2021-10-18 23:17:35.563092 | vscode_internal_commits
2021-10-18 23:17:36.267878 | vscode_internal_commit_counts
2021-10-18 23:17:36.541539 | vscode_external_committers
2021-10-18 23:17:36.579981 | vscode_external_commits
2021-10-18 23:17:36.757549 | vscode_external_commit_counts
2021-10-18 23:17:36.766478 | vscode_issues
2021-10-19 00:02:38.071931 | vscode_issue_filers
2021-10-19 00:02:39.090606 | vscode_issue_filer_details
2021-10-19 00:02:39.094259 | vscode_internal_issue_filers
2021-10-19 00:02:39.220729 | vscode_internal_issues
2021-10-19 00:02:39.701516 | vscode_internal_issue_counts
2021-10-19 00:02:39.735744 | vscode_external_issue_filers
2021-10-19 00:02:39.893828 | vscode_external_issues
2021-10-19 00:02:40.835652 | vscode_external_issue_counts
2021-10-19 00:02:41.169555 | vscode_external_contributors
2021-10-19 00:02:41.349196 | vscode_external_commit_timelines

Each phase of the pipeline produces a table, which helps make things robust to failure. While developing this pipeline, for example, we triggered the hourly API rate limit several times, but were able to pick up where we left off. None of this is specific to Steampipe, it's just a good way to do analytics based on API-sourced data.

To run the pipeline for yourself, try the SQL script you can see in the Show Me the SQL appendix at the end.

Topline stats for committers

Of 1502 committers, we classifed 1348 as external (leaving 153 internal) based on these rules:

  • Their logins don't appear in the members field of the github_organization whose name is microsoft.
  • Their company and email columns, from github_user, don't match microsoft.
  • Their logins match a handful of enumerated exceptions.

Topline stats for issue authors

Of 52,028 issue authors, we classifed 51,9123 (leaving 382 as internal) based on similar rules. It was impractical to fetch the company and email info for all 52,028, that would take 10 hours at 5K API calls/hr. So we used committer data as a proxy for issue data when excluding Microsoft-affiliated issue authors. That's less precise, but the primary filter -- membership in Microsoft's GitHub organization -- is pretty good.

There were 116,241 closed issues, and 5400 open issues.

The averge length of time between opening and closing issues was about the same for issues filed by internal authors (57 days) and external authors (54 days).

Histograms for the two categories confirm that while the absolute numbers differ, the pattern is again very similar.

Commits vs issues for external contributors

This chart compares the number of commits and issues for each external contributor. There are some intriguing outliers: check out DanTup over there on the far right of the x axis, with 1 commit and 466 issues!

Top 10 external committers and issue authors

These are the top 10 external committers.

author_logincounttwitter
michelkaporin311michel_the_man
jeanp413190
Lixire111
bgashler194
usernamehw87
cleidigh86
gjsjohnmurray68
Krzysztof-Cieslak49
Kingwl45WenluWang
dependabot[bot]41

These are the top 10 external issue authors.

author_logincounttwitter
DanTup466DanTup
usernamehw393
AccessibilityTestingTeam-TCS359
RMacfarlane312
gjsjohnmurray173
vsccarl163
smlombardi145
mousetraps130mousetraps
v-pavanp130
fabiospampinato129fabiospampinato

Top 10 internal committers and issue authors

For comparison, these are the top 10 internal committers.

author_logincounttwitter
bpasero10364BenjaminPasero
jrieken8834johannesrieken
joaomoreno8142
mjbvz6398mattbierner
alexdima6174
sandy0815911
isidorn5865
Tyriar5274Tyriar
aeschli3848
rebornix3573

These are the top 10 internal issue authors.

author_logincounttwitter
bpasero3926BenjaminPasero
tyriar2582
jrieken2113johannesrieken
roblourens1729
joaomoreno1718
sandy0811187
alexdima1043
weinand1036
dbaeumer846
rebornix722

Activity timespans for external committers

This chart shows the first and last commit dates for each external committer.

These are the external committers with the longest spans.

author_login | days
----------------|---------
yume-chan | 1517
Huachao | 1385
forivall | 1367
Ikuyadeu | 1353
akosyakov | 1320
jeanp413 | 1303
JoshuaKGoldberg | 1135
usernamehw | 1118
71 | 1102

Internal and external issue tags

These are the top 10 tags for issues filed by internal authors.

tag | count
--------------------|---------
verified | 9204
bug | 9167
feature-request | 3342
insiders-released | 2459
debug | 1951
*duplicate | 1782
debt | 1594
terminal | 1524
testplan-item | 1158
verification-needed | 1105

These are the top 10 tags for issues filed by external authors.

tag | count
---------------------|---------
needs more info | 18448
*duplicate | 15250
feature-request | 11530
bug | 10144
verified | 8533
*caused-by-extension | 5493
terminal | 4034
*question | 3887
debug | 3863
upstream | 3336

Beyond the GitHub API

What you've seen here isn't specific to the VSCode repo, it'll work for any large GitHub repo. If you try the pipeline shown here -- perhaps modifying it to answer some of your own questions, or pointing it at another repo -- let us know how it goes.

We think this pipeline will interest people who care deeply about, and study, open source community dynamics. What if your game is something entirely different? Well, if it revolves around compliance with cloud security recommendations, we've got you covered. That's a sweet spot for Steampipe that we'll explore in future posts.

More broadly, the methods shown here can apply across the whole set of Steampipe plugins. If you find yourself reaching for APIs to answer questions in any of those domains, give Steampipe a try and again let us know what you find.

Show me the SQL

Instructions:

  1. Install Steampipe: https://steampipe.io/downloads
  2. Install the GitHub plugin: steampipe plugin install github
  3. Configure the GitHub plugin with your personal access token: https://hub.steampipe.io/plugins/turbot/github#credentials
  4. Install psql for Postgres 12: https://www.compose.com/articles/postgresql-tips-installing-the-postgresql-client/
  5. Save this script as vscode.sql
  6. steampipe service start
  7. psql -h localhost -p 9193 -d steampipe -U steampipe < vscode.sql

You can also find this script in the steampipe-samples repo, along with a tool that can generate versions of the script to analyze other repos in the same way.

drop table if exists vscode_log;
create table vscode_log(time timestamp, event text);
drop table if exists vscode_org_members;
insert into vscode_log(time, event) values (now(), 'vscode_org_members');
create table vscode_org_members as (
select
g.name,
g.login,
jsonb_array_elements_text(g.member_logins) as member_login
from
github_organization g
where
g.login = 'microsoft'
);
drop table if exists vscode_commits;
insert into vscode_log(time, event) values (now(), 'vscode_commits');
create table vscode_commits as (
select
g.repository_full_name,
g.author_login,
g.author_date,
g.commit->'author'->>'email' as author_email,
g.committer_login,
g.committer_date
from
github_commit g
where
g.repository_full_name = 'microsoft/vscode'
);
drop table if exists vscode_committers;
insert into vscode_log(time, event) values (now(), 'vscode_committers');
create table vscode_committers as (
with unordered as (
select distinct
c.repository_full_name,
c.author_login
from
vscode_commits c
)
select
*
from
unordered
order by
lower(author_login)
);
drop table if exists vscode_committer_details;
insert into vscode_log(time, event) values (now(), 'vscode_committer_details');
create table vscode_committer_details as (
select
g.login,
g.name,
g.company,
g.email,
g.twitter_username
from
github_user g
join
vscode_committers c
on
c.author_login = g.login
);
drop table if exists vscode_internal_committers;
insert into vscode_log(time, event) values (now(), 'vscode_internal_committers');
create table vscode_internal_committers as (
with by_membership as (
select
*
from
vscode_committers c
join
vscode_org_members o
on
c.author_login = o.member_login
order by
c.author_login
),
by_vscode_committer_details as (
select
*
from
vscode_committer_details cd
where
cd.company ~* 'microsoft' or cd.email ~* 'microsoft'
order by
cd.login
),
combined as (
select
m.author_login as m_login,
cd.login as c_login
from
by_membership m
full join
by_vscode_committer_details cd
on
m.author_login = cd.login
),
merged as (
select
case
when m_login is null then c_login
else m_login
end as author_login
from
combined
)
select
*
from
merged
order by
lower(author_login)
);
drop table if exists vscode_internal_commits;
insert into vscode_log(time, event) values (now(), 'vscode_internal_commits');
create table vscode_internal_commits as (
select
*
from
vscode_commits c
join
vscode_internal_committers i
using
(author_login)
);
drop table if exists vscode_internal_commit_counts;
insert into vscode_log(time, event) values (now(), 'vscode_internal_commit_counts');
create table vscode_internal_commit_counts as (
select
i.repository_full_name,
i.author_login,
count(*)
from
vscode_internal_commits i
group by
i.repository_full_name,
i.author_login
order by
count desc
);
drop table if exists vscode_external_committers;
insert into vscode_log(time, event) values (now(), 'vscode_external_committers');
create table vscode_external_committers as (
select
*
from
vscode_committers c
where not exists (
select
*
from
vscode_internal_committers i
where
c.author_login = i.author_login
or c.author_login = any ( array ['octref','eamodio'] )
)
order by
c.author_login
);
drop table if exists vscode_external_commits;
insert into vscode_log(time, event) values (now(), 'vscode_external_commits');
create table vscode_external_commits as (
select
*
from
vscode_commits c
join
vscode_external_committers i
using
(repository_full_name, author_login)
);
drop table if exists vscode_external_commit_counts;
insert into vscode_log(time, event) values (now(), 'vscode_external_commit_counts');
create table vscode_external_commit_counts as (
select
e.repository_full_name,
e.author_login,
count(*)
from
vscode_external_commits e
group by
e.repository_full_name,
e.author_login
order by
count desc
);
drop table if exists vscode_issues;
insert into vscode_log(time, event) values (now(), 'vscode_issues');
create table vscode_issues as (
select
repository_full_name,
author_login,
issue_number,
title,
created_at,
closed_at,
state,
comments,
tags
from
github_issue
where
repository_full_name = 'microsoft/vscode'
);
drop table if exists vscode_issue_filers;
insert into vscode_log(time, event) values (now(), 'vscode_issue_filers');
create table vscode_issue_filers as (
with unordered as (
select distinct
i.repository_full_name,
i.author_login
from
vscode_issues i
)
select
*
from
unordered
order by
lower(author_login)
);
-- insert into vscode_log(time, event) values (now(), 'vscode_issue_filer_details');
-- create table vscode_issue_filer_details as (
--
-- impractical for vscode's 52K issue authors at 5K API calls/hr!'
--
--);
drop table if exists vscode_internal_issue_filers;
insert into vscode_log(time, event) values (now(), 'vscode_internal_issue_filers');
create table vscode_internal_issue_filers as (
select
*
from
vscode_issue_filers i
join
vscode_org_members o
on
i.author_login = o.member_login
order by
i.author_login
);
drop table if exists vscode_internal_issues;
insert into vscode_log(time, event) values (now(), 'vscode_internal_issues');
create table vscode_internal_issues as (
select
i.repository_full_name,
lower(i.author_login) as author_login,
i.issue_number,
i.created_at,
i.closed_at,
i.comments,
i.state,
i.title,
i.tags
from
vscode_issues i
join
vscode_internal_issue_filers if
on
i.author_login = if.author_login
and i.repository_full_name = if.repository_full_name
order by author_login
);
drop table if exists vscode_internal_issue_counts;
insert into vscode_log(time, event) values (now(), 'vscode_internal_issue_counts');
create table vscode_internal_issue_counts as (
select
i.repository_full_name,
i.author_login,
count(*)
from
vscode_internal_issues i
group by
i.repository_full_name,
i.author_login
order by
count desc
);
drop table if exists vscode_external_issue_filers;
insert into vscode_log(time, event) values (now(), 'vscode_external_issue_filers');
create table vscode_external_issue_filers as (
with unfiltered as (
select
*
from
vscode_issue_filers i
-- use vscode_internal_committers as a proxy for vscode_internal_issue_filers, which
-- would require 52K github_user calls (at 5K/hr)
where not exists (
select
*
from
vscode_internal_committers c
where
c.author_login = i.author_login
)
order by
i.author_login
)
select
*
from
unfiltered u
where
not u.author_login = any ( array ['ghost', 'octref', 'vscodeerrors', 'eamodio'] )
);
drop table if exists vscode_external_issues;
insert into vscode_log(time, event) values (now(), 'vscode_external_issues');
create table vscode_external_issues as (
select
*
from
vscode_issues i
join
vscode_external_issue_filers e
using
(repository_full_name, author_login)
);
drop table if exists vscode_external_issue_counts;
insert into vscode_log(time, event) values (now(), 'vscode_external_issue_counts');
create table vscode_external_issue_counts as (
select
e.repository_full_name,
e.author_login,
count(*)
from
vscode_external_issues e
group by
e.repository_full_name,
e.author_login
order by
count desc
);
drop table if exists vscode_external_contributors;
insert into vscode_log(time, event) values (now(), 'vscode_external_contributors');
create table vscode_external_contributors as (
select
c.repository_full_name,
c.author_login,
c.count as vscode_commits,
'https://github.com/microsoft/vscode/commits?author=' || c.author_login as commits_url,
i.count as vscode_issues,
'https://github.com/microsoft/vscode/issues?q=author:' || c.author_login as issues_url,
cd.name,
cd.company,
cd.twitter_username
from
vscode_external_commit_counts c
full join
vscode_external_issue_counts i
using
(repository_full_name, author_login)
join
vscode_committer_details cd
on
c.author_login = cd.login
order by
lower(c.author_login)
);
drop table if exists vscode_external_commit_timelines;
insert into vscode_log(time, event) values (now(), 'vscode_external_commit_timelines');
create table vscode_external_commit_timelines as (
with data as (
select
e.repository_full_name,
e.author_login,
min(c.author_date) as first,
max(c.author_date) as last
from
vscode_external_contributors e
join
vscode_commits c
using (repository_full_name, author_login)
group by
e.repository_full_name, e.author_login
)
select
repository_full_name,
author_login,
to_char(first, 'YYYY-MM-DD') as first,
to_char(last, 'YYYY-MM-DD') as last
from
data d
where
d.first != d.last
order by
first, last
);