Show HN: CRTX – AI code gen that tests and fixes its own output (OSS)

https://github.com/CRTXAI/CRTX

1•johnnycash926•2h ago

We built an open-source CLI that generates code, runs tests, fixes failures, and gets an independent AI review — all before you see the output. We started with a multi-model pipeline where different AI models handled different stages (architect, implement, refactor, verify). We assumed more models meant better code. Then we benchmarked it: 39% average quality score at $4.85 per run. A single model scored 94% at $0.36. Our pipeline was actively making things worse. So we killed it and rebuilt around what developers actually do when they get AI-generated code: run it, test it, fix what breaks. The Loop generates code, runs pytest automatically, feeds failures back for targeted fixes, and repeats until all tests pass. Then an independent Arbiter (always a different model than the generator) reviews the final output. Latest benchmark across three tasks (simple CLI, REST API, async multi-agent system): Single Sonnet: 94% avg, 10 min dev time, $0.36 Single o3: 81% avg, 4 min dev time, $0.44 Multi-model: 88% avg, 9 min dev time, $5.59 CRTX Loop: 99% avg, 2 min dev time, $1.80 "Dev time" estimates how long a developer would spend debugging the output before it's production-ready. The Loop's hardest prompt produced 127 passing tests with zero failures. When the Loop hits a test it can't fix, it has a three-tier escalation: diagnose the root cause before patching, strip context to just the failing test and source file, then bring in a different model for a second opinion. The goal is zero dev time on every run. Model-agnostic — works with Claude, GPT, o3, Gemini, Grok, DeepSeek. Bring your own API keys. Apache 2.0. pip install crtx https://github.com/CRTXAI/crtx We published the benchmark tool too — run crtx benchmark --quick to reproduce our results with your own keys. Curious what scores people get on different providers and tasks.

Comments

johnnycash926•54m ago

Creator here. Solo developer, building in public. Some context that didn’t fit above: we published the full benchmark data and the tool to reproduce it because we think the AI code gen space has a transparency problem. Everyone claims their tool is better, nobody shows data. The “dev time” metric is the one I’m most proud of. It estimates how long you’d spend debugging the output before it’s production-ready. A model can score 95% but still hand you code with broken imports and failing tests — that’s 15 minutes of your time. The Loop’s goal is zero. Website with more details: https://crtx-ai.com Happy to answer questions about the benchmark methodology, the gap closing system, or the architecture. And if anyone runs crtx benchmark --quick with their own keys, I’d genuinely love to see the results.

Meta Deployed AI and It Is Killing Our Agency

dwata: Local Financial Data Extraction from Emails with Ministral 3 3B, Ollama

Show HN: Claude Chrome Parallel – Ultrafast Parallel Browser MCP for Chrome

OpenAI considered alerting Canadian police about school shooting suspect

Topological Naming Problem

Can we debug a living cell like a running binary?

Tiny QR code achieved using electron microscope technology

The Fundamental Limits of LLMs at Scale

A perceptual-first mobile audio DSP experiment

Saturn's Rings Came from a Two-Moon Collision About 100M Years Ago

A man who triggered the AI explosion(2020) – Alex Krizhevsky [video]

How to Use Goosetown for Parallel Agentic Engineering

Checkset – a Ruby gem for repeatable verifications using Playwright

Understanding LLM from scratch Using middle school math

Process Isolation on NetBSD with Chroot(2)

Hardware LLM at 16K Tokens/s

With Nvidia's GB10 Superchip, I'm Running Serious AI Models in My Living Room

Former Debian Project Leader Cautions Against Cover-Up and Censorship in Debian

TabType – Universal text expansion for macOS for your context

Show HN: Git uncommit – reset unpushed, committed changes

The New Digg.com Is Slop

Show HN: JVBar CIS Benchmark scanner and remediation script generator

Designing a Document Management System from Scraps

OpenAI Employees Raised Alarms About Canada Shooting Suspect Months Ago

Show HN: Polya's urn – essays on complexity and emergence

Open Letter to Tech Companies: Protect Your Users from Lawless DHS Subpoenas

MCP Servers Reaches 79K GitHub Stars

I made a local AI creature that runs on integers

Phil Spencer Retires from Microsoft and Xbox

Show HN: Assay – Found 250 bugs in LiteLLM, LobeChat via AI code verification