Author here. I picked this problem because it was assigned to me in Jira. Energy usage attribution logic for smart EV charging incentives in Python/Django. Pure arithmetic, clear postconditions. Best possible case for formal verification. If it couldn't prove its value here, it couldn't prove it anywhere.
So I built a Claude Code plugin called Crosscheck that uses Dafny (backed by Z3). Five proof obligations, all discharged on the first attempt. The math was trivially correct.
Then I tried to integrate it and everything fell apart. Four bugs, none expressible in any verification language. Decimal precision: the function computed to 6dp, the Django model stored 3dp. Enum coercion: `session.type.value` returned `int 0` instead of `"SMART"`. A test factory that didn't set `transaction_period`, so the function silently returned early. Test passed, code did nothing. And a custom TestCase base class that blocked Decimal comparisons entirely. Two were mismatches between components that individually worked fine. Two were test theatre: the exact failure mode I'd built verification to escape.
The postconditions turned out to be useful in a way I didn't expect: as property test oracles at the integration boundary. A Hypothesis test from `period1 + period2 == total` catches the silent skip immediately and bypasses the broken assertEqual. The spec bridges the gap: proven in Dafny, enforced in Python.
I also ran static analysis across 14 open-source codebases (2.5M lines). Pure, verifiable functions: 22-27% of code, remarkably stable across Python and TypeScript. Kleppmann predicted (https://martin.kleppmann.com/2025/12/08/ai-formal-verificati...) that AI will make formal verification go mainstream. I think he's right about the provers, but the mainstream ceiling is a quarter of the codebase.
hnipps•1h ago
So I built a Claude Code plugin called Crosscheck that uses Dafny (backed by Z3). Five proof obligations, all discharged on the first attempt. The math was trivially correct.
Then I tried to integrate it and everything fell apart. Four bugs, none expressible in any verification language. Decimal precision: the function computed to 6dp, the Django model stored 3dp. Enum coercion: `session.type.value` returned `int 0` instead of `"SMART"`. A test factory that didn't set `transaction_period`, so the function silently returned early. Test passed, code did nothing. And a custom TestCase base class that blocked Decimal comparisons entirely. Two were mismatches between components that individually worked fine. Two were test theatre: the exact failure mode I'd built verification to escape.
The postconditions turned out to be useful in a way I didn't expect: as property test oracles at the integration boundary. A Hypothesis test from `period1 + period2 == total` catches the silent skip immediately and bypasses the broken assertEqual. The spec bridges the gap: proven in Dafny, enforced in Python.
I also ran static analysis across 14 open-source codebases (2.5M lines). Pure, verifiable functions: 22-27% of code, remarkably stable across Python and TypeScript. Kleppmann predicted (https://martin.kleppmann.com/2025/12/08/ai-formal-verificati...) that AI will make formal verification go mainstream. I think he's right about the provers, but the mainstream ceiling is a quarter of the codebase.
Source code is at https://github.com/nicholls-inc/claude-code-marketplace/tree... if you want to see the Dafny integration.
Happy to answer questions about the static analysis methodology, or the contract graph verifier idea at the end.