Why AI code fails differently: What I learned talking to 200 engineering teams

8•pomarie•2mo ago

Hey HN, I'm Paul, co-founder of cubic (YC X25). Over the past few months, I've talked to over 200 engineering teams about how they're using AI to ship code.

I kept hearing the same pattern: some teams are shipping 10-15 AI PRs daily without issues. Others tried once, broke production, and gave up entirely.

The difference wasn't what I expected– it wasn't about model choice or prompt engineering.

---

One team shipped an AI-generated PR that took down their checkout flow.

Their tests and CI passed, but AI had "optimized" their payment processing by changing `queueAnalyticsEvent()` to `analytics.track()`. The analytics service has a 2-second timeout so when it's slow, payment processing times out.

In prod, under real load, 95th percentile latency went from 200ms to 8 seconds. Ended up with 3hh of downtime and $50k in lost revenue.

Everyone on that team knew you queue analytics events asynchronously, but that wasn't documented anywhere. It's just something they learned when analytics had an outage years ago.

*The pattern*

Traditional CI/CD catches syntax errors, type mismatches, test failures.

The problem is that AIs don't make these mistakes. (Or at least, tests and lints catch them before they get committed). The problem with AI is that it generates syntactically perfect code that violates your system's unwritten rules.

*The institutional knowledge problem*

Every codebase has landmines that live in engineers' heads, accumulated through incidents.

AIs can't know these, so they fall into the traps. It's then on the code reviewer to spot them.

*What the successful teams do differently*

They write constraints in plain English. Then AI enforces them semantically on every PR. Eg. "All routes in /billing/* must pass requireAuth and include orgId claim"

AI reads your code, understands the call graph, and blocks merges that violate the rules.

*The bottleneck*

When you're shipping 10x more code, validation becomes the constraint; not generation speed.

The teams shipping AI at scale aren't waiting for better models. They're using AI to validate AI-generated code against their institutional knowledge.

The gap between "AI that generates code" and "AI you can trust in production" isn't about model capabilities, it's about bridging the institutional knowledge gap.

Comments

pomarie•2mo ago

We're building something at cubic that helps with this. You write your constraints in plain English, and AI enforces them semantically on every PR.

If you're curious, you can check it out here: https://cubic.dev

Happy to answer any questions about what we've seen working (or not working) across different teams.

GreenGames•2mo ago

Super interesting take Paul. Curious btw, how are these teams actually encoding their “institutional knowledge” into constraints? Like is it some manual config or more like natural‑language rules that evolve with the codebase?

pomarie•2mo ago

Good q! So it depends.

Some teams are using Claude or similar models in GitHub Actions, which automatically review PRs. The rules are basically natural language encoded in a YAML file that's committed in the codebase. Pretty lightweight to get started.

Other teams upgrade to dedicated tools like cubic. We have a feature where you can encode your rules either in our UI, or we're releasing a feature where you can write them directly in your codebase. We'll check them on every PR and leave comments when something violates a constraint.

The in-codebase approach is nice because the rules live next to the code they're protecting, so they evolve naturally as your system changes.

veunes•2mo ago

The "in-codebase" approach is the right one, but a YAML file with plain text is a half-measure. The most reliable rule that "lives next to the code" is an architectural test. An ArchUnit test verifying that "all routes in /billing/* call requireAuth" is also code, it's versioned with the project, and it breaks the build deterministically That is a more robust engineering solution, unlike semantic text interpretation, which can fail

veunes•2mo ago

The observation is very accurate, but the conclusion is incomplete. The "unwritten rules" problem is, first and foremost, a symptom of a weak engineering culture and a lack of documentation. If a rule is critical to system stability (like the async analytics), it shouldn't be "living in engineers' heads"

Instead of layering on another AI for validation, maybe code generation should be used as a catalyst to finally formalize these rules. Turn them into custom linting rules, architectural tests (like with ArchUnit), or just well-written documentation that a model can be fine-tuned on. Using AI as a crutch for bad processes is a dangerous path

BTDUex Safe? The Back End Withdrawal Anomalies

Show HN: Compile-Time Vibe Coding

Show HN: Ensemble – macOS App to Manage Claude Code Skills, MCPs, and Claude.md

PR to support XMPP channels in OpenClaw

Twenty: A Modern Alternative to Salesforce

Raspberry Pi: More memory-driven price rises

Level Up Your Gaming

Di.day is a movement to encourage people to ditch Big Tech

Show HN: AI generated personal affirmations playing when your phone is locked

Show HN: GTM MCP Server- Let AI Manage Your Google Tag Manager Containers

Launch of X (Twitter) API Pay-per-Use Pricing

Facebook seemingly randomly bans tons of users

Global Bird Count

What Is Ruliology?

Jon Stewart – One of My Favorite People – What Now? with Trevor Noah Podcast [video]

P2P crypto exchange development company

Vocal Guide – belt sing without killing yourself

Write for Your Readers Even If They Are Agents

Knowledge-Creating LLMs

Maple Mono: Smooth your coding flow

Sid Meier's System for Real-Time Music Composition and Synthesis

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Empusa – Visual debugger to catch and resume AI agent retry loops

Show HN: Bitcoin wallet on NXP SE050 secure element, Tor-only open source

White House Explores Opening Antitrust Probe on Homebuilders

Show HN: MindDraft – AI task app with smart actions and auto expense tracking

How do you estimate AI app development costs accurately?

Going Through Snowden Documents, Part 5

Show HN: MCP Server for TradeStation

Canada unveils auto industry plan in latest pivot away from US