I did a few manual analyses but found it non-trivial to compare across models due to difference in token caching and tool-use efficiency and so wanted a tool for repeatable evaluations.
So the goal was an OSS tool get data to help answer questions like:
“Would Sonnet have solved most of the issues we gave Opus? "How much would that have actually saved?” “What about OSS models like Kimi K2.5 or GLM-1?” “The vibes are off, did model performance just regress from last month?”
Right now the project is a bit medium-rare - but it works end-to-end. I’ve run it successfully against itself, and I’m waiting for my token limits to reset so I can add support for more languages and do a broader run. I'm already seeing a few cases where I could've used 5.4-mini instead of 5.4 for some parts of implementation.
I’d love any feedback, criticism, and ideas. I am especially interested if this is something you might pay for as a managed service or if you would contribute your private testcases to a shared commons hold-out set to hold AI providers a bit more accountable.
https://repogauge.org hi@repogauge.org https://github.com/s1liconcow/repogauge
Thanks! David