Tested against GPT-5 and Claude 4.5 on 10 production specifications:
- Bauform: 10/10 pass all validation gates
- GPT-5: 0/10 (generates Streamlit UIs instead of REST APIs)
- Claude 4.5: 0/10 (same failure)
The problem: Frontier models pattern-match "build a validator" → "create Streamlit demo" regardless of actual requirements asking for production APIs.
tekodu•8h ago
The problem: Frontier models pattern-match "build a validator" → "create Streamlit demo" regardless of actual requirements asking for production APIs.
Try it yourself: - Live beta: https://bauform-beta.fly.dev/ - Benchmark with all results: https://github.com/tekodu/bauform-evals - Quick API test: curl -X POST https://bauform-beta.fly.dev/v1/engine/generate \ -H "Content-Type: application/json" \ -d '{"spec": "CSV validator with REST API", "params": {}}' - Analysis paper (under peer review): https://www.dropbox.com/scl/fi/vtmztpdkm0ns86qapxp5p/bauform...
We use 5-gate validation: functional, security, limits, latency, stability. Binary pass/fail - production either works or doesn't.
The results are cryptographically signed (Ed25519) and fully reproducible.
Happy to answer questions about the methodology or system architecture.