ProofOfThought: LLM-based reasoning using Z3 theorem proving

https://github.com/DebarghaG/proofofthought

73•barthelomew•1h ago

Comments

LASR•1h ago

This is an interesting approach.

My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.

Then we run the solver to verify consistency.

When a wiki page is changed, the process is run again and it's essentially a linter for process.

Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.

But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.

barthelomew•50m ago

The autoformalization gap is pretty difficult to bridge indeed. We explored uncertainty quantification of autoformalization on well-defined grammars in our NeurIPS 2025 paper : https://arxiv.org/abs/2505.20047 .

If you ever feel like chatting and discussing more details, happy to chat!

viraptor•48m ago

Could you share an example of such policy? I'm struggling to think of something defined well enough in the real world to apply in Lean.

sigmoid10•1h ago

I always find it amazing how many people seem to fail to use current LLMs to the fullest, even though they apparently work with them in research settings. This benchmark pipeline simply calls the OpenAI API and then painstakingly tries to parse the raw text output into a structured json format, when in reality the OpenAI API has supported structured outputs for ages now. That already ensures your model generates schema compliant output without hallucinating keys at the inference level. Today all the major providers support this feature either directly or at least indirectly via function calling. And if you run open models, you can literally write arbitrary schema (i.e. not limited to json behind the scenes) adhering inference engines yourself with rather manageable effort. I'm constantly using this in my daily work and I'm always baffled when people tell me about their hallucination problems, because so many of them can be fixed trivially these days.

atrus•56m ago

I wouldn't find it amazing, there are so many new models, features, ways to use models that the minute you pause to take a deep dive into something specific, 43 other things have already passed by you.

sigmoid10•53m ago

I would agree if you are a normal dev who doesn't work in the field. But even then reading the documentation once a year would have brought you insane benefits regarding this particular issue. And for ML researchers there is no excuse for stuff like that at this point.

retinaros•55m ago

yes this can also improve the said reasoning.

sigmoid10•50m ago

The secret the big companies don't want to tell you is that you can turn all their models into reasoning models that way. You even have full control over the reasoning process and can make it adhere to a specific format, e.g. the ones used in legal settings. I've built stuff like that using plain old gpt-4o and it was even better than the o series.

jssmith•47m ago

I see JSON parse errors on occasion when using OpeanAI structured outputs that resolve upon retry. It seems it’s giving instructions to the LLM but validation is still up to the caller. Wondering if others see this too.

barthelomew•40m ago

Hey, yes! This is because the DSL (Domain Specific Language) is pretty complex, and the LLM finds it hard. We prototype a much more effective version using SMT in our NeurIPS 2025 paper (https://arxiv.org/abs/2505.20047). We shall soon open source that code!

sigmoid10•38m ago

Depends on how strictly you define your types. Are you using pydantic to pass the information to the API? There are a few pitfalls with this, because not everything is fully supported and it gets turned into json behind the scenes. But in principle, the autoregressive engine will simply not allow tokens that break the supplied schema.

striking•33m ago

Not sure if I've been using it wrong but I've tried using the Zod-to-structured-output helper with GPT-5 and often gotten weird stuff like trailing commas that break a parse or seeing multiple JSON responses in the same response.

Ultimately there are still going to be bugs. For this reason and several others you'll still need it wrapped in a retry.

sigmoid10•27m ago

Yeah that sounds 100% like a user or middleware issue. Don't bother with these wrappers, they are always outdated anyways. Learn how to use the API directly, it will save you a ton of headaches. And it's really not that hard.

striking•18m ago

No, we're using the OpenAI vendored version of zod-to-json-schema via https://github.com/transitive-bullshit/openai-zod-to-json-sc..., and applying it directly to the `json_schema` field of the OpenAI API. Maybe we have a subtle bug somewhere but I'd expect a 400 response if we were genuinely sending a malformed request.

barthelomew•43m ago

Hey there! I mostly designed and wrote most of the actual interpreter during my internship at Microsoft Research last summer. Constrained decoding for GPT-4 wasn’t available when we started designing the DSL, and besides, creating a regex to constrain this specific DSL is quite challenging.

When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.

sigmoid10•32m ago

[flagged]

dang•27m ago

> What are you talking about?

Please edit out swipes like this from your HN comments—this is in the site guidelines: https://news.ycombinator.com/newsguidelines.html. It comes across as aggressive, and we want curious conversation here.

Your comment would be fine without that bit.

sigmoid10•23m ago

This is not meant as snide, I'm literally confused if I might have misunderstood the problem here. Because the solution would be so obvious.

IanCal•40m ago

I’d also be surprised if the models are better at writing code in some custom schema (assuming that’s not z3s native structure) than writing code in something else. Decent models can write pretty good code and for a lot of mistakes can fix them, plus you get testing/etc setups for free.

ivanbakel•58m ago

The repo is sparse on the details unless you go digging, which perhaps makes sense if this is just meant as the artifact for the mentioned paper.

Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.

The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?

The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.

barthelomew•26m ago

Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.

E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.

Illustration of this can be seen in Fig 14, 15 on Page 29.

In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.

This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.

https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

measurablefunc•57m ago

This is proof of verifiable logic. Computers can not think so calling it proof of thought misrepresents what's actually happening.

aSanchezStern•46m ago

I agree that "proof of thought" is a misleading name, but this whole "computers can't think" thing is making LLM skepticism seem very unscientific. There is no universally agreed upon objective definition of what it means to be able to "think" or how you would measure such a thing. The definition that these types of positions seem to rely upon is "a thing that only humans can do", which is obviously a circular one that isn't useful.

measurablefunc•28m ago

If you believe computers can think then you must be able to explain why a chain of dominoes is also thinking when I convert an LLM from transistor relay switches into the domino equivalent. If you don't fall for the marketing hype & study both the philosophical & mathematical literature on computation then it is obvious that computers (or any mechanical gadget for that matter) can not qualify for any reasonable definition of "thinking" unless you agree that all functionally equivalent manifestations of arithmetic must be considered "thinking", including cascading dominoes that correspond to the arithmetic operations in an LLM.

bobxmax•7m ago

And your definition of thinking is?

nextos•53m ago

This is a very interesting area of research. I did something similar a couple of years ago using logic and probabilistic logic inference engines to make sure conclusions followed from premises.

I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.

I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934

barthelomew•46m ago

Yep! I have read your work! Pretty cool! I also worked on a similar deep research agent for autoformalization this summer at AWS ARChecks, building on similar patterns.

Although that work is not public, you can play with the generally available product here!

[1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

CuriouslyC•45m ago

Agent/LLM as a judge is biased and only good for bootstrapping. As capabilities get better LLM as a judge will artificially cap your performance, you need to graduate to either expert human judges or deterministic oracles.

jebarker•9m ago

Why does this have to be true? For example, if you have a different LLM that is judging than the one being judged then their biases could at least be different. Also, as their reasoning abilities improve wouldn't LLM judges approach the abilities of human judges?

nakamoto_damacy•43m ago

LLMs lack logical constraints in the generative process; they only learn probabilistic constraints. If you apply logic verification post-hoc, you're not "ensuring the correctness of your LLMs reasoning" (I went down this path a year ago); you're classifying whether the LLM's statistically driven pattern generation happens to correspond to correct logic or not, where the LLMs output may be wrong 100% of the time, and your theorem prover simply acts as a classifier, ensuring nothing at all.

barthelomew•21m ago

Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)

Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)

You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.

zwnow•42m ago

Reasoning? LLMs can not reason, why is it always assumed they reason? They mimic reasoning.

elcomet•37m ago

How can you know?

dehsge•22m ago

LLMs and its output are bounded by Rices theorem. This is not going to ensure correctness it’s just going to validate that the model can produce an undecidable result.

Say Goodbye

Designers Should Look to Demis Hassabis. Not Jony Ive

John Edmark's geometric patterns underlying space and growth in nature

Storytel Player – unofficial desktop client for Storytel audiobooks

ML Models for Yield Estimation of Rice Cultivars Using UAV Imagery

America fell behind China in the lunar space race – how it can catch back up

The UK Is Still Trying to Backdoor Encryption for Apple Users

The Czech Trump wins an election, again

Plans for Building Democranet: A new on-chain voting system for governance

Autism, A.D.H.D., Anxiety: Can a Diagnosis Make You Better?

Formally Verified Code Benchmark

Summary of Nix drama so far. Sep27 – Oct4 2025

Thin for the Win: Qualcomm Shows Fanless Desktop PCs Powered Snapdragon X2 Elite

The Free Software Foundation is livestreaming its 40th anniversary celebration

Show HN: Imbi-Automations

Why NetNewsWire Is Not a Web App

Saturn's moon Enceladus shows potential for habitability

Economics and AI (Tom Cunningham)

Microsoft 365 Copilot is a commercial failure

Why Conservatives Are Attacking 'Wokepedia'

Nimletter – Self-hosted newsletter, drip and transactional email system

On Agency, and Design

ProPublica Releases New Private School Demographics Lookup

New Poll: Democratic Socialism Is Now Mainstream

Show HN: OneDollarChat – Global chat where messages cost $1

How Thomas Peterffy automated trading, made Interactive Brokers a $100B business

Investor Convicted of Stealing Homes Is Jailed

Show HN: ZapForms – Create public forms with instant APIs and webhooks

Native iOS SwiftUI Client for Actual Budget

When your ISP pays you