On the semantic gap between the original software and its representation in the ITP, program extraction like in Rocq probably deserves some discussion, where the software is written natively in the ITP and you have to prove the extraction itself sound. For example, Meta Rocq did this for Rocq.
For the how far down the stack problem, there are some efforts from https://deepspec.org/, but it's inherently a difficult problem and often gets less love than the lab environment projects.
I can appreciate what he's getting at, but my utopian vision for the future is that we won't need to reinvent the wheel like this each time we want verified software! E.g. for high-consequence systems, the hard part of compiler correctness is already handled by the efforts of CompCert, and SystemVerilog assertions for the design guarantees of processors is becoming more commonplace.
If we can use AI to automatically implement a formal spec, then that formal specification language has just become a programming language.
There are plenty of formal specifications that cannot be compiled, even not by an AI. If you use AI, how do you make sure that the AI compiler compiles correctly?
Plenty of real programming languages are ambiguous in ways that surely disqualify them as formal specification languages. A trivial example in C: decrementing an unsigned int variable that holds 0. The subtraction is guaranteed to wrap around, but the value you get depends on the platform, per the C standard.
> There are plenty of formal specifications that cannot be compiled, even not by an AI. If you use AI, how do you make sure that the AI compiler compiles correctly?
By proving that the code satisfies the formal spec. Getting from a formal spec to a program (in an imperative programming language, say) can be broken down into several stages of 'refinement'. A snippet from [0] :
> properties that are proved at the abstract level are maintained through refinement, hence are guaranteed to be satisfied also by later refinements.
[0] https://www.southampton.ac.uk/~tsh2n14/publications/chapters...
And yes, if you can prove that the implementation is correct with respect to the formal spec, you are good, and it doesn't really matter how you got the implementation.
Refinement is one approach, personally, I just do interactive theorem proving.
Not really, on both counts.
Firstly they're not really programming languages in the usual sense, in that they don't describe the sequence of instructions that the computer must follow. Functional programming languages are considered 'declarative', but they're still explicit about the computational work to be done. A formal spec doesn't do this, it just expresses the intended constraints on the correspondence between input and output (very roughly speaking).
Secondly, regarding the we don't know how to compile it aspect: 'constraint programming' and SMT solvers essentially do this, although they're not a practical way to build most software.
So, even if the proof is correct, you need to determine if the theorem is what you want. Making that determination requires expertise. Since you cannot "run the theorem", you cannot vibe-code your way through it. E.g., there is no equivalent of "web app seems to be working!" You have to actually understand what the theorems are saying in a deep way.
Even without the Lean technical details, a lot of math theorems just don't mean anything to most people. For example, I have no idea what the Navier-Stokes theorem is saying. So, I would not be able to tell you if a Lean encoding of the theorem is correct. (Unless of course, it is trivially broken, since as assuming False.)
Not at all. Theorem crafting is hard. What's easy is standing up and proving a mathematical strawman which may or may not have any relation to the original problem.
Writing a statement you don't understand then later getting a proof from an LLM doesn't seem all that useless to me; in my mind, it could still be useful as exploration. Worst case scenario: you encoded a tautology and the LLM gave you a trivial proof. Best case scenario: the proposition ends up being a lemma for something you want to prove.
I do think there is a kernel of truth in what you've stated: if the user does not actually understand the statement of a proposition, then the proof is not very useful to them, since they don't know what the statement's truthiness implies. As someone who used to do mathematics, I still find immense value in vibe-coding away mundane computations, similar to what Lean's `simp` tactic already does, but much more broad.
False => P
Then you vibe code a proof of this theorem. Then you get excited that you’ve proven P.
Some of the X discussion that prompted the OP was quite close to this. There are screenshots on X of Lean code that doesn’t compile, with Lean being blamed.
In any case, the worst case scenario is having a vacuously true statement. I picked tautologies as an example since that and statements about the empty set are the first things that come into my mind.
I don't know if that's possible but it seems like that's where the value would be.
But this is still double black diamond expert territory - it remains a challenging thing to identify ways to state and refine those properties to enable the proof. But it's getting easier by the year and the LLMs are getting more useful as part of it.
https://github.com/verus-lang/verus
(1) You would have to state the consistency and conditional liveness properties directly, of course, as Verus doesn't understand "what paxos does". That would look like TLA+-style statements, "at some time in the future, a command submitted now will be executed, etc."
1. https://chatgpt.com/share/696b7f8a-9760-8006-a1b5-89ffd7c5d2...
Your comment is certainly correct and I agree that the various implementations of LLM probably can not actually partition attempts to find proofs into any given logical system.
My comment was more tongue in cheek than your response. I phrased it awkwardly obviously. I was humorously hoping that at some fundamental level, involving unintentionally taking advantage of a previously unknown foundational rule of computer science, that LLM’s as ‘thinking’ algorithms simply could not understand or utilize non-constructive logical means to formulate a proof.
As I said, I did not think this is actually what’s going on with GPT not being able to actually, or convincingly, ‘understand’ the law of the excluded middle. It was more of backhanded insult at LLMs, particularly, and those sales people who want to talk about them as thinking, reasoning, semi-conscious algorithmic ‘beings’.
We have been building GFQL, an open source graph query language that can run at the compute tier seperate from whatever existing storage system (splunk, SQL, cs , ..), making graphs easy for interactive notebooks/dashboards/analytics/pipelines/agents without needing graph DBs, even at billion scales. Part of the trick is it is a bit unusual inside too because we are making cypher-style queries run with columnar operations (purely vectorized for both CPU + GPU mode via a multi-target dataframe abstraction layer), but that is taking a few innovations in the interpreter algorithm so we worry about subtle semantic bugs. Most of our bugs can surface with small (graph, query) combos, so this sounded perfect for Alloy's small scope hypothesis!
So we are having a natural experiment for which does more bug finding:
- asking Claude code to analyze new code or recently found issues and using that for unit test amplification
- asking Claude code to use the alloy model checker to prove out our space of (query, graph) for key methods
- asking Claude code to port a large cypher conformance test suite
So far, Claude Code as a test amplifier is doing so much better than the others that it's our Dr default preference. The real gem is when we make it do a '5 whys root cause analysis' on why some bug got through and then it does test amplification around several categories of potential error.
We found the conformance suite to be 'class imbalanced' by focusing on a couple features we are adding next, so jury still out on that one. And finally... Alloy hasn't really found anything. We suspect we need to change how we use Alloy to be more reflective of the kinds of bugs we find, but 0 was disappointing.
If anyone is into this kind of thing, happy to discuss/collab!
The question is, of course, what portion of programs are much easier than the worst case (to the point of making verification tractable). That is not known, but the results are not necessarily encouraging. Programs that have been deductively verified proof assistants were not only very small, and not only written with proofs in mind, but were also restricted to use simple and less efficient algorithms to make the proofs doable. They tend to require between 10x and 1000x lines of proof per line of code.
(an old blog post of mine links to some important results: https://pron.github.io/posts/correctness-and-complexity)
There is a belief that programs that people write and more-or-less work should be tractably provable, as that's what allowed writing them in the first place (i.e. the authors must have had some vague proof of correctness in mind when writing them). I don't think this argument is true (https://pron.github.io/posts/people-dont-write-programs), and we use formal methods precisely when we want to close the gap between working more-or-less and definitely always working.
But even if verifying some program is tractable, it could still take a long time between iterations. Because it's reasonable that it would take an LLM no less than a month to prove a correct program, there's no point in stopping it before. So a "practical" approach could be: write a program, try proving it for a month, and if you haven't finished in a month, try again. That, of course, could mean waiting, say, six months before deciding whether or not it's likely to ultimately succeed. Nevertheless, I expect there will be cases where writing the program and the proof would have taken a team of humans 800 years, and an LLM could do it in 80.
Rochus•3w ago
ratmice•3w ago
Mis-defining concepts can be extremely subtle, if you look at the allsome quantifier https://dwheeler.com/essays/allsome.html you'll see that these problems predate AI, and I struggle to see how natural language is going to help in cases like the "All martians" case where the confusion may be over whether martians exist or not. Something relatively implicit.
smarx007•3w ago
To be... more precise?
On a more serious note, cannot recommend enough "Exactly: How Precision Engineers Created the Modern World" by Winchester. While the book talks mostly about the precision in mechanical engineering, it made me appreciate _precision_ itself to a greater degree.
ratmice•3w ago
smarx007•3w ago
lindenr•3w ago
For many statements I expect it's not possible to retain the exact meaning of the formal-language sentence without the natural language becoming at least as complex, and if you don't retain meaning exactly then you're vulnerable to the kind of thing the article warns about.
guenthert•3w ago
Perhaps we must not rely on it and find a way to make sure that it cannot fail, but I like to point out that this are two different problems and it seems to me that the current crop of so called AIs are pretty good at distilling excerpts. Perhaps that's the easier problem to solve?
Rochus•3w ago
> What does one gain besides familiarity by translation back into a more ambiguous language?
You gain intent verification. Formal languages are precise about implementation, but they are often opaque about intent. A formal specification can be "precisely wrong". E.g. you can write a perfectly precise Event-B spec that says "When the pedestrian button is pressed, the traffic light turns Green for cars"; the formalism is unambiguous, the logic is sound, the proof holds, but the intent is fatally flawed. Translating this back to natural language ("The system ensures that pressing the button turns the car light green") allows a human to instantly spot the error.
> All Martians are green
Modern LLMs are actually excellent at explicating these edge cases during back-translation if prompted correctly. If the formal spec allows vacuous truth, the back-translation agent can be instructed to explicitly flag existential assumptions. E.g. "For every Martian (assuming at least one exists), the color is Green", or "If there are no Martians, this rule is automatically satisfied". You are not translating back to casual speech; you are translating back to structured, explicit natural language that highlights exactly these kinds of edge cases.
ratmice•3w ago
Rochus•3w ago
We don't back-translate the proof steps (the thousands of intermediate logical derivations). That would indeed be verbose and useless.
We back-translate the specification: the Invariants, Guards, and Events.
For a traffic light system, we don't need the LLM to explain the 50 steps of predicate logic that prove inv3 holds. We just need it to translate inv3 itself:
This isn't verbose; it's the exact concise summary of the system's safety rules. The 'verbosity' of handling edge cases (like the 'Allsome' example) only applies when the specification itself relies on subtle edge cases, in which case, being verbose is exactly what you want to prevent a hidden bug.ratmice•2w ago
I just feel like the street light example is an extremely small free standing example. Most things that I feel are worth the effort of proving end up huge. Forever formal verification languages were denigrated for being overly rigid and too verbose. I feel like translations into natural language can only increase that if they are accurate.
One thing I wish is this whole discussion was less intertwined with AI. The semantic gap has existed before AI, and will be run into again without AI. People have been accidentally proving the wrong thing true or false forever and will never stop with our without AI help.
At the very least we can agree that the problem exists, and while i'm skeptical of natural language as being anything but the problem we ran away from. At least you're trying something and exploring the problem space and that can only be cheered.
Rochus•2w ago
viraptor•3w ago
It's a solution only if the translation is proven correct. If not, you're in the same place as you started.
Rochus•2w ago