We built an open-source CLI that generates code, runs tests, fixes failures, and gets an independent AI review — all before you see the output.
We started with a multi-model pipeline where different AI models handled different stages (architect, implement, refactor, verify). We assumed more models meant better code. Then we benchmarked it: 39% average quality score at $4.85 per run. A single model scored 94% at $0.36. Our pipeline was actively making things worse.
So we killed it and rebuilt around what developers actually do when they get AI-generated code: run it, test it, fix what breaks. The Loop generates code, runs pytest automatically, feeds failures back for targeted fixes, and repeats until all tests pass. Then an independent Arbiter (always a different model than the generator) reviews the final output.
Latest benchmark across three tasks (simple CLI, REST API, async multi-agent system):
Single Sonnet: 94% avg, 10 min dev time, $0.36
Single o3: 81% avg, 4 min dev time, $0.44
Multi-model: 88% avg, 9 min dev time, $5.59
CRTX Loop: 99% avg, 2 min dev time, $1.80
"Dev time" estimates how long a developer would spend debugging the output before it's production-ready. The Loop's hardest prompt produced 127 passing tests with zero failures.
When the Loop hits a test it can't fix, it has a three-tier escalation: diagnose the root cause before patching, strip context to just the failing test and source file, then bring in a different model for a second opinion. The goal is zero dev time on every run.
Model-agnostic — works with Claude, GPT, o3, Gemini, Grok, DeepSeek. Bring your own API keys. Apache 2.0.
pip install crtx
https://github.com/CRTXAI/crtx
We published the benchmark tool too — run crtx benchmark --quick to reproduce our results with your own keys. Curious what scores people get on different providers and tasks.
johnnycash926•54m ago