I formally verified AI-generated code. All 4 bugs were in the integration layer

https://brainflow.substack.com/p/formally-verifying-the-easy-part

1•hnipps•1h ago

Comments

hnipps•1h ago

Author here. I picked this problem because it was assigned to me in Jira. Energy usage attribution logic for smart EV charging incentives in Python/Django. Pure arithmetic, clear postconditions. Best possible case for formal verification. If it couldn't prove its value here, it couldn't prove it anywhere.

So I built a Claude Code plugin called Crosscheck that uses Dafny (backed by Z3). Five proof obligations, all discharged on the first attempt. The math was trivially correct.

Then I tried to integrate it and everything fell apart. Four bugs, none expressible in any verification language. Decimal precision: the function computed to 6dp, the Django model stored 3dp. Enum coercion: `session.type.value` returned `int 0` instead of `"SMART"`. A test factory that didn't set `transaction_period`, so the function silently returned early. Test passed, code did nothing. And a custom TestCase base class that blocked Decimal comparisons entirely. Two were mismatches between components that individually worked fine. Two were test theatre: the exact failure mode I'd built verification to escape.

The postconditions turned out to be useful in a way I didn't expect: as property test oracles at the integration boundary. A Hypothesis test from `period1 + period2 == total` catches the silent skip immediately and bypasses the broken assertEqual. The spec bridges the gap: proven in Dafny, enforced in Python.

I also ran static analysis across 14 open-source codebases (2.5M lines). Pure, verifiable functions: 22-27% of code, remarkably stable across Python and TypeScript. Kleppmann predicted (https://martin.kleppmann.com/2025/12/08/ai-formal-verificati...) that AI will make formal verification go mainstream. I think he's right about the provers, but the mainstream ceiling is a quarter of the codebase.

Source code is at https://github.com/nicholls-inc/claude-code-marketplace/tree... if you want to see the Dafny integration.

Happy to answer questions about the static analysis methodology, or the contract graph verifier idea at the end.

Show HN: Meddle – AI-powered IIoT platform for small manufacturers

My custom agent used 87% fewer tokens when I gave it Skills for its MCP tools

Why does a Stochastic Parrot make sense at all?

Capyra – open-source agent runtime for SAP B1 and WhatsApp

The environmental cost of datacentres is rising. Is it time to quit AI?

A Couple of Git Nits

Are we ready for film distribution via USB drives?

I Take My Laptop to the Gym So Claude Doesn't Have Downtime

Show HN: X07, compiled language where agents write correct code on the first try

The 3-Day Starter Plan for Raspberry Pi Beginners

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters

Wanted: Europe's Missing Cloud Provider

Free tool to compare SASE vendors side-by-side

Revealed: The worst mega-leaks of methane driving global heating

Death of a Strawman: The Epistemology of a Language Model

Ask HN: With Promptfoo acquired by OpenAI, what are MCP devs using for testing?

Show HN: Specifica – an open format for writing software specs as Markdown

Show HN: I'm trying to help aspiring Data Analysts

UK security adviser attended US-Iran talks and judged deal was within reach

The Great Developer Schism: Process vs. Product [video]

Show HN: MCP Isn't Dead. You're Just Using It Wrong

CBM-BASIC: Commodore BASIC–style interpreter written in C

A collaborative pixel mural where each 16×16 tile is owned and editable

X11 user daemon to automatically run commands triggered by user specified events

Nvidia Built the A.I. Era. Now It Has to Defend It

Show HN: MUP – Interactive UI inside LLM chat, so anyone can use agentic AI

Samsung to Discontinue Galaxy Z TriFold After Just Three Months

VEO – Open-source content-adaptive video encoding optimizer in Go

Trapped Inside a Self-Driving Car During an Anti-Robot Attack

Java 26 Released