A challenger agent injects bugs and writes ground truth (`bugs.json`). A different reviewer agent audits the repo without seeing ground truth, and an LLM matcher scores bug-to-finding assignments.
Current run: 50 repos, 150 challenges, 450 reviews, 2,603 injected bugs.
Weighted detection: Claude 58.05%, Codex 37.84%, Gemini 27.81%.
LLM-judge benchmarks are easy to get wrong, so I’d really appreciate critical feedback on benchmark fairness, scoring/matching methodology, and obvious failure modes I’m missing.
Full dataset is linked in the docs.