I built this to answer a question for myself: which model should I actually route each type of task to? The harness runs 38 deterministic tests (CSV transforms, letter counting, modular arithmetic, regex extraction, code gen, multi-step instructions), costs $2.29 per full run across all 15 models, and all scoring is programmatic. No LLM judge for primary scores.
The surprising part was the QA process. My initial results showed Haiku beating Sonnet. That turned out to be a json_array scorer bug where max_score was set to expected_row_count instead of len(expected_rows), producing quality scores above 100%. A thin-space Unicode character (U+2009) in Gemini Flash responses broke three regex scorers silently. I ended up running 5 separate QA passes, each using a different model, and each pass found bugs the previous ones missed.
Gemini 2.5 Flash scored 97.1% at $0.003/run w/ a 1.1s median response time. Opus scored 100% at $0.69/run. GPT-oss-20b scored 98.3% for $0. The cost spread across models that all score above 95% is genuinely hard to justify for most tasks.
Scoring code and raw results are in the post. Happy to answer questions about methodology.
ianlpaterson•2h ago
The surprising part was the QA process. My initial results showed Haiku beating Sonnet. That turned out to be a json_array scorer bug where max_score was set to expected_row_count instead of len(expected_rows), producing quality scores above 100%. A thin-space Unicode character (U+2009) in Gemini Flash responses broke three regex scorers silently. I ended up running 5 separate QA passes, each using a different model, and each pass found bugs the previous ones missed.
Gemini 2.5 Flash scored 97.1% at $0.003/run w/ a 1.1s median response time. Opus scored 100% at $0.69/run. GPT-oss-20b scored 98.3% for $0. The cost spread across models that all score above 95% is genuinely hard to justify for most tasks.
Scoring code and raw results are in the post. Happy to answer questions about methodology.