Designing Predictable LLM-Verifier Systems for Formal Method Guarantee

https://arxiv.org/abs/2512.02080

59•PaulHoule•1mo ago

Comments

brantmv•1mo ago

Maybe I'm wrong, but it looks like the authors did not actually have any LLMs write or verify any code for their experiments. Instead, their experiments consist of simulating the simplified Markov chain model itself. They simulated their simple Markov chain and checked if the theorem's predictions matched empirical statistics. This amounts to a test not of their model, but of basic Markov chain theory.

Did I misread or miss something?

brantmv•1mo ago

Also, the mathematical content here is pretty thin. Their main theorem has nothing to do with LLMs directly. It's a theorem about a five-state Markov chain, and the proof follows from standard Markov chain theory.

For those reasons, the grandiose name "LLM-Verifier Convergence Theorem" does not sit well with me.

mapontosevenths•1mo ago

This line made me pause:

"We prove that for any non-zero stage success probability, the system reaches the verified state almost surely"

What's the point if its still stochastic?

IanCal•1mo ago

Hash collisions are possible but can be provably so rare that they’re not a relevant concern.

jaggederest•1mo ago

"almost surely" means "happens with a probability 1", which in infinite set contexts doesn't mean that there aren't other outcomes, but that they have probability 0.

So like, imagine that you had some finite list of integers, and you were picking a random number from 0 to infinity - because the domain is infinite, any finite set has 0 probability, but that doesn't mean it doesn't exist.

https://en.wikipedia.org/wiki/Almost_surely

mapontosevenths•1mo ago

Thank you. That makes this a pretty big deal doesn't it?

The ability to deterministcly identify that code eventually reaches a halting state, implies that we can use these stochastic tools to generate deterministic outcomes reliably in the future doesn't it?

jaggederest•1mo ago

Well, reliably but still with a chance of failure - in the same way that you can have a program which is provably correct but can still run into real world issues like being killed, but yes I would say that "almost surely" is a pretty large jump from "more than likely" (50%+1) where I'd say LLM output generally lives these days.

MiniMax42•1mo ago

> a chance of failure

Well, technically, no chance of failure. The chance of failure is absolute zero. Not close to zero, absolute zero. There will be no failure if the assumptions of the model are correct.

The real catch here is in the assumptions.

How long do you have before you need to have a solution? An hour, a year, a century? Too bad, almost sure convergence only provides a guarantee if you wait an infinite amount of time.

And then there's the question of the probability space you assume. (The sigma algebra.) Which things do you assume to have probability zero from the start and is that realistic?

mapontosevenths•1mo ago

> How long do you have before you need to have a solution? An hour, a year, a century? Too bad, almost sure convergence only provides a guarantee if you wait an infinite amount of time.

Thanks for this. I was actually just thinking "this can't actually work, it would mean P vs NP is solved." Of course, this explains why it doesn't mean that.

werf456•1mo ago

Can check out this recent paper doing scalable formal verification of LLMs "BEAVER: An Efficient Deterministic LLM Verifier": https://arxiv.org/abs/2512.05439

lebron72•1mo ago

This paper looks pretty groundbreaking. The ability to verify LLMs at scale (e.g., 70B) on real-world tasks like math reasoning and code security is extremely impressive and impactful.

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

Vectors and HNSW for Dummies

Sanskrit AI beats CleanRL SOTA by 125%

'Washington Post' CEO resigns after going AWOL during job cuts

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

Vectors and HNSW for Dummies

Sanskrit AI beats CleanRL SOTA by 125%

'Washington Post' CEO resigns after going AWOL during job cuts

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres

Designing Predictable LLM-Verifier Systems for Formal Method Guarantee

Comments