- 3000 rubrics on code quality. First benchmark to measure: "would this code get actually merged?"
- 20+ expert open-source maintainer created tasks on their own repos to capture their opinion & taste.
- 40+ hours of real human work per task. total 1000+ hours of real life software maintainer work captured in dataset
- results in 81% lower false positive rate than SWE-Bench Pro
- High quality bar: many QA stages & each task manually reviewed by Cognition researchers (examples in post)
Opus 4.8 scores 13% on FrontierCode Diamond.
one of my goals was also to datamine interesting stuff even on the easy tasks. for example, if you squint you can see the answer to "WTF Happened in late 2025" with coding models: https://x.com/swyx/status/2064081945567580323
great_psy•24m ago
How do you measure quality at scale ? Is there another model that determines if it adheres to codebase standard ?
swyx•15m ago
see Beyond Unit Tests and Novel Grading Methods in TFA.
i think something like ~60% llm as judge rubrics and the rest as described. every rubric validated by maintainer. 3000 rubrics
singpolyma3•19m ago
Since no one knows or can agree on what "code quality" is and we can't measure it for human output, I'm dubious about measuring it for LLMs
swyx•27m ago
some headlines
- 3000 rubrics on code quality. First benchmark to measure: "would this code get actually merged?"
- 20+ expert open-source maintainer created tasks on their own repos to capture their opinion & taste.
- 40+ hours of real human work per task. total 1000+ hours of real life software maintainer work captured in dataset
- results in 81% lower false positive rate than SWE-Bench Pro
- High quality bar: many QA stages & each task manually reviewed by Cognition researchers (examples in post)
Opus 4.8 scores 13% on FrontierCode Diamond.
one of my goals was also to datamine interesting stuff even on the easy tasks. for example, if you squint you can see the answer to "WTF Happened in late 2025" with coding models: https://x.com/swyx/status/2064081945567580323
great_psy•24m ago
swyx•15m ago
i think something like ~60% llm as judge rubrics and the rest as described. every rubric validated by maintainer. 3000 rubrics