Cool project, I feel like I have been running my own mental, gut feeling degeneration tracker so far.
- Is a daily sample size of 50 really sufficient to distinguish actual model degradation from the inherent stochasticity of SWE-bench?
- Since you're running directly in the CLI, do you also track 'silent' degradations like increased token usage or latency, or is it strictly pass/fail?
- What are the§ API costs for daily Opus 4.5 runs on 50 SWE tasks?
qwesr123•13h ago
Thanks! Daily confidence intervals are quite large and not super useful at the moment. Weekly aggregation is more sensitive. Hoping to increase sample sizes but it is quite expensive! Would be about $100-$150/day in API costs. We are using the Pro x20 subscription ($200/month).
Regarding more subtle degradation tracking, it is on the roadmap.
7777777phil•12h ago
Cheers! Great work! Let me know if there's a way to follow the development.
7777777phil•13h ago
- Is a daily sample size of 50 really sufficient to distinguish actual model degradation from the inherent stochasticity of SWE-bench? - Since you're running directly in the CLI, do you also track 'silent' degradations like increased token usage or latency, or is it strictly pass/fail? - What are the§ API costs for daily Opus 4.5 runs on 50 SWE tasks?
qwesr123•13h ago
Regarding more subtle degradation tracking, it is on the roadmap.
7777777phil•12h ago