AI will make formal verification go mainstream

https://martin.kleppmann.com/2025/12/08/ai-formal-verification.html

6•mau•1mo ago

Comments

mvr123456•1mo ago

This is going to happen and is real stuff we could be working towards with the tools that we already have. No need for AGI vaporware, no waiting around for the perfect agentic playground. Not even necessarily a big requirement for excellent reasoning. Just using LLMs for what they are actually good at, i.e. fuzzy translators, stylistic filters, and compilers.

Gradual-typing was practice and hints. Like gradual typing, gradual spec'ing could be an iterative and kind of parallel annotated representation, ignored by the main runtime unless called for, and ignored by developers that aren't interested in it. But when there's enough of it to hit some kind of critical mass, then it's suddenly very powerful and lots of very interesting stuff is possible

mutkach•1mo ago

I certainly hope so.

I wonder, what is the actual blocker right now? I'd assume that LLMs are still not very good with specifications and verifcation languages? Anyone tried Datalog, TLA+, etc. with LLMs? I suppose that Gemini was specifically trained on Lean. Or at least some IMO-finetuned fork of it. Anyhow, there's probably a large Lean dataset collected somewhere in Deepmind servers, but that's not certification applicable necessarily, I think?

> AI also creates a need to formally verify more software: rather than having humans review AI-generated code, I’d much rather have the AI prove to me that the code it has generated is correct.

At RL stage LLMs could game the training*, proving easier invariants then actually expected (the proof is correct and possibly short - means positive reward). It would take additional care it to set it up right.

* I mean, if you set it to generate a code AND a proof to it.

lukeasrodgers•1mo ago

I would like this to be true, but am a bit skeptical.

I am what the article calls an "industrial software engineer" and I work on "low- to medium-assurance" projects, but have used various formal methods (alloy and TLA+) in my work to prevent and discover bugs.

I've experimented with using LLMs to generate both Alloy and TLA+ a couple times over the past years, and the problems I see are:

- They have gotten better over the last few years, but still can only produce useful results in the hands of someone who is moderately competent. Becoming moderately competent requires many hours of investment in these tools, and you will lose much of this competence if you don't keep it up. For example, I can still read TLA+ and Pluscal but can't write them without lots of referring to the docs because I only write them like once or twice a year.

- They suffer even more from GIGO than other aspects of software development. If you can't really rigorously define your problem you will get a bad model/output that only gives you false confidence. A large part of the value of doing formal methods is building the muscle for thinking rigorously. Hillel Wayne says this in several places, that doing enough TLA+ (e.g.) work gives you a much better innate sense for where there will be race conditions.

- There will still be a cultural and technical problems with integrating formal methods, and their artifacts, into the rest of your codebase and team. For example, how do you prevent drift? Will you have a CI automation that uses an LLM to detect when the spec has diverged from the code?

I'm not saying it is impossible that this will happen, and I would love to be wrong, but the general tendency I see with LLM use is to make software developers less intimately familiar with their tools, and less invested in deeply understanding their code. That bodes ill for formal methods even more than regular programming.

mutkach•1mo ago

What would you suggest as a reference problem (a benchmark of sorts) to try to play with formal methods for someone with just a bit of formal verification background but not in the field of software verification? Can you suggest some helpful materials?

I've come across TLA+ multiple times, but it seems it was more targeted towards distributed systems (Lamport being the creator, that makes sense). Is it correct, that it would be useless in other domains?

The Rise of Spec Driven Development

The first good Raspberry Pi Laptop

Seas to Rise Around the World – But Not in Greenland

Will Future Generations Think We're Gross?

Kernel Key Retention Service

State Department will delete Xitter posts from before Trump returned to office

Show HN: Verifiable server roundtrip demo for a decision interruption system

Impl Rust – Avro IDL Tool in Rust via Antlr

Stories from 25 Years of Software Development

minikeyvalue

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

How I grow my X presence?

What's the cost of the most expensive Super Bowl ad slot?

What if you just did a startup instead?

Hacking up your own shell completion (2020)

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

GLM-OCR: Accurate × Fast × Comprehensive

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

Show HN: AboutMyProject – A public log for developer proof-of-work

Expertise, AI and Work of Future [video]

So Long to Cheap Books You Could Fit in Your Pocket

PID Controller

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

Kubernetes MCP Server

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

What were the first animals? The fierce sponge–jelly battle that just won't end

Sidestepping Evaluation Awareness and Anticipating Misalignment

OldMapsOnline

What It's Like to Be a Worm