Designing Predictable LLM-Verifier Systems for Formal Method Guarantee

https://arxiv.org/abs/2512.02080

59•PaulHoule•1mo ago

Comments

brantmv•1mo ago

Maybe I'm wrong, but it looks like the authors did not actually have any LLMs write or verify any code for their experiments. Instead, their experiments consist of simulating the simplified Markov chain model itself. They simulated their simple Markov chain and checked if the theorem's predictions matched empirical statistics. This amounts to a test not of their model, but of basic Markov chain theory.

Did I misread or miss something?

brantmv•1mo ago

Also, the mathematical content here is pretty thin. Their main theorem has nothing to do with LLMs directly. It's a theorem about a five-state Markov chain, and the proof follows from standard Markov chain theory.

For those reasons, the grandiose name "LLM-Verifier Convergence Theorem" does not sit well with me.

mapontosevenths•1mo ago

This line made me pause:

"We prove that for any non-zero stage success probability, the system reaches the verified state almost surely"

What's the point if its still stochastic?

IanCal•1mo ago

Hash collisions are possible but can be provably so rare that they’re not a relevant concern.

jaggederest•1mo ago

"almost surely" means "happens with a probability 1", which in infinite set contexts doesn't mean that there aren't other outcomes, but that they have probability 0.

So like, imagine that you had some finite list of integers, and you were picking a random number from 0 to infinity - because the domain is infinite, any finite set has 0 probability, but that doesn't mean it doesn't exist.

https://en.wikipedia.org/wiki/Almost_surely

mapontosevenths•1mo ago

Thank you. That makes this a pretty big deal doesn't it?

The ability to deterministcly identify that code eventually reaches a halting state, implies that we can use these stochastic tools to generate deterministic outcomes reliably in the future doesn't it?

jaggederest•1mo ago

Well, reliably but still with a chance of failure - in the same way that you can have a program which is provably correct but can still run into real world issues like being killed, but yes I would say that "almost surely" is a pretty large jump from "more than likely" (50%+1) where I'd say LLM output generally lives these days.

MiniMax42•1mo ago

> a chance of failure

Well, technically, no chance of failure. The chance of failure is absolute zero. Not close to zero, absolute zero. There will be no failure if the assumptions of the model are correct.

The real catch here is in the assumptions.

How long do you have before you need to have a solution? An hour, a year, a century? Too bad, almost sure convergence only provides a guarantee if you wait an infinite amount of time.

And then there's the question of the probability space you assume. (The sigma algebra.) Which things do you assume to have probability zero from the start and is that realistic?

mapontosevenths•1mo ago

> How long do you have before you need to have a solution? An hour, a year, a century? Too bad, almost sure convergence only provides a guarantee if you wait an infinite amount of time.

Thanks for this. I was actually just thinking "this can't actually work, it would mean P vs NP is solved." Of course, this explains why it doesn't mean that.

werf456•1mo ago

Can check out this recent paper doing scalable formal verification of LLMs "BEAVER: An Efficient Deterministic LLM Verifier": https://arxiv.org/abs/2512.05439

lebron72•1mo ago

This paper looks pretty groundbreaking. The ability to verify LLMs at scale (e.g., 70B) on real-world tasks like math reasoning and code security is extremely impressive and impactful.

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Hoot: Scheme on WebAssembly

We Mourn Our Craft

The AI boom is causing shortages everywhere else

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

The F Word

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

72M Points of Interest

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

History and Timeline of the Proco Rat Pedal (2021)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Hoot: Scheme on WebAssembly

We Mourn Our Craft

The AI boom is causing shortages everywhere else

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

The F Word

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

72M Points of Interest

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

History and Timeline of the Proco Rat Pedal (2021)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Designing Predictable LLM-Verifier Systems for Formal Method Guarantee

Comments