We ran three experiments since v1 that changed the model substantially.
We tried to detect suspended GitHub accounts from behavioral signals (merge rate, network centrality, TF-IDF on PR titles, LLM classification with ~31K Gemini calls). Best individual AUC was 0.619 on a 1.9% base rate. The merged-PR population is too homogeneous. Accounts that pass code review look like everyone else. The interesting finding: the suspension rate among contributors with merged PRs is under 2%. The review process is a better filter than the discourse around AI slop suggests.
That led us to question the scoring model. The graph score (bipartite construction, personalized ranking, language normalization, the whole pipeline from v1) actively hurts predictions for the contributors who actually need scoring: unknown people with a handful of merged PRs. Merge rate alone outperforms merge rate plus graph at every tier we tested. The new default model is merged / (merged + closed). We also pulled account age out of the score into a separate advisory after DeLong tests showed it adds nothing once you condition on merge rate.
The post has the full data, including the tables.
Next we're working on content scoring (does this PR fit this repo's conventions?) and cold-start tooling (helping new contributors understand project expectations before they submit). Contributor reputation is one input to review triage. The PR itself carries more signal.
jeffreysmith•2h ago
We ran three experiments since v1 that changed the model substantially.
We tried to detect suspended GitHub accounts from behavioral signals (merge rate, network centrality, TF-IDF on PR titles, LLM classification with ~31K Gemini calls). Best individual AUC was 0.619 on a 1.9% base rate. The merged-PR population is too homogeneous. Accounts that pass code review look like everyone else. The interesting finding: the suspension rate among contributors with merged PRs is under 2%. The review process is a better filter than the discourse around AI slop suggests.
That led us to question the scoring model. The graph score (bipartite construction, personalized ranking, language normalization, the whole pipeline from v1) actively hurts predictions for the contributors who actually need scoring: unknown people with a handful of merged PRs. Merge rate alone outperforms merge rate plus graph at every tier we tested. The new default model is merged / (merged + closed). We also pulled account age out of the score into a separate advisory after DeLong tests showed it adds nothing once you condition on merge rate.
The post has the full data, including the tables.
Next we're working on content scoring (does this PR fit this repo's conventions?) and cold-start tooling (helping new contributors understand project expectations before they submit). Contributor reputation is one input to review triage. The PR itself carries more signal.
Repo: https://github.com/2ndSetAI/good-egg
pip install good-egg
Or just run it via uvx.