DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning [pdf]

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

231•fspeech•2mo ago

Comments

photon_lines•2mo ago

Exciting stuff from a fantastic team.

zaxioms•2mo ago

It's cool, but I genuinely cannot fathom why they are targeting natural language proofs instead of a proof assistant.

mamami•2mo ago

Natural language is a lot more, well, readable than say lean. You get a lot less intuition and understanding of what the model is attempting to do in the first place.

natrys•2mo ago

Well they do that too: https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B

But I suppose the bigger goal remains improving their language model, and this was an experimentation born from that. These works are symbiotic; the original DeepSeekMath resulted in GRPO, which eventually formed the backbone of their R1 model: https://arxiv.org/abs/2402.03300

blazespin•2mo ago

More training data on advanced math. Lean is cool, but it's mostly about formalizing stuff we already know.

zaxioms•2mo ago

Ok I guess I could have told you that. What I really meant is that in the future where LLMs are doing new math (which I'm skeptical of, but I digress) I would not trust any of it unless it was formally verified.

blazespin•2mo ago

if you read the paper that is the intention, to guide stuff like lean.

i don't think llm is a great pure rlvr

Davidzheng•2mo ago

I think there's a lot of baggage doing it in lean. like what the libraries are at currently. how things are implemented. which things are not implemented, etc. but it still remains to be seen what wins (my money would be on informal)

gjm11•2mo ago

The "obvious" thing to try, which presumably some people are trying pretty hard right now[1], is to (1) use a mathematically-tuned LLM like this one to propose informal Next Things To Try, (2) use an LLM (possibly the same LLM) to convert those into proof assistant formalism, (3) use the proof assistant to check whether what the LLM has suggested is valid, and (4) hook the whole thing together to make a proof-finding-and-verifying machine that never falsely claims to have proved something (because everything goes through that proof assistant) and therefore can tolerate confabulations from LLM #1 and errors from LLM #2 because all those do is waste some work.

[1] IIRC, AlphaProof is a bit like this. But I bet that either there's a whole lot of effort on this sort of thing in the major AI labs, or else there's some good reason to expect it not to work that I haven't thought of. (Maybe just the "bitter lesson", I guess.)

It would doubtless be challenging to get such a system to find large difficult proofs, because it's not so easy to tell what's making progress and what isn't. Maybe you need LLM #3, which again might or might not be the same as the other two LLMs, to assess what parts of the attempt so far seem like they're useful, and scrub the rest from the context or at least stash it somewhere less visible.

It is, of course, also challenging for human mathematicians to find large difficult proofs, and one of the reasons for them is that it's not so easy to tell what's making progress and what isn't. Another major reason, though, is that sometimes you need a genuinely new idea, and so far LLMs aren't particularly good at coming up with those. But a lot of new-enough-ideas[2] are things like "try a version of this technique that worked well in an apparently unrelated field", which is the kind of thing LLMs aren't so bad at.

[2] Also a lot of the new-enough-ideas that mathematicians get really happy about. One of the cool things about mathematics is the way that superficially-unrelated things can turn out to share some of their structure. If LLMs get good at finding that sort of thing but never manage any deeper creativity than that, it could still be enough to produce things that human mathematicians find beautiful.

awei•2mo ago

Something weird here, why is it so hard to have a deterministic program capable of checking a proof or anything math related, aren't maths super deterministic when natural language is not. From first principles, it should be possible to do this without a llm verifier.

riku_iki•2mo ago

such high performance program indeed could potentially be superior, if it would exist (this area is very undeveloped, there is no existing distributed well established solution which could handle large domain) and math would be formalized in that program's dsl, which also didn't happen yet.

jebarker•2mo ago

I haven’t read the paper yet, but I’d imagine the issue is converting the natural language generated by the reasoner into a form where a formal verifier can be applied.

JacobiX•2mo ago

I think that mathematical proofs, as they are actually written, rely on natural language and on a large amount of implicit shared knowledge. They are not formalized in the Principia Mathematica sense, and they are even further from the syntax required by modern theorem provers. Even the most rigorous proofs such as those in Bourbaki are not directly translatable into a fully formal system.

drawnwren•2mo ago

If you don't mind stretching your brain a bit, Wittgenstein was obsessed with this notion. https://www.bu.edu/wcp/Papers/Educ/EducMaru.htm#:~:text=Witt...

xemdetia•2mo ago

Maths can be super deterministic but often difficult to compute because of concepts like inferring by induction. I had to personally unlearn and rebase my understanding of math based in computation to 'get' pure maths. Another example is set building. You often don't need to compute the existence of members of sets in pure math you just need to agree that there are some members of a set that meet the criteria. How many or how many things that aren't in the set aren't meaningful often times to accept something and move on with the proof. From the computing perspective this can be difficult to put together.

blazespin•2mo ago

Verifying math requires something like Lean which is a huge bottleneck, as the paper explains.

Plus there isn't a lot of training data in lean.

Most gains come from training on stuff already out there, not really the RLVR part which just amps it up a bit.

naasking•2mo ago

> why is it so hard to have a deterministic program capable of checking a proof or anything math related, aren't maths super deterministic when natural language is not.

Turing machines are also deterministic, but there is no algorithm that can decide whether any given Turing machine halts. What you're asking for is a solution to the Halting Problem.

That's the first problem, the second problem is that any such system that didn't support natural language would require a formal language of some sort, and then you would have to convince every mathematician to write their proofs in your language so it can be checked. All attempts at this have failed to gain much traction, although Lean has gotten pretty far.

crvdgc•2mo ago

Checking the validity of a given proof is deterministic, but filling in the proof in the first place is hard.

It's like Chess, checking who wins for a given board state is easy, but coming up with the next move is hard.

Of course, one can try all possible moves and see what happens. Similar to Chess AI based on search methods (e.g. MinMax), there are proof search methods. See the related work section of the paper.

blazespin•2mo ago

who likely wins, fify

awei•2mo ago

Thanks to everyone who replied, I understand it better now!

agentultra•2mo ago

So it's designed for informal proofs and it "verifies" based on a rubric fitting function and human interaction, is that right?

What's the use case for a system like this?

blazespin•2mo ago

Advanced math solving, as the results indicate. Informal proof reasoning is advancing faster than formal proof reasoning because the latter is slow and compute intensive.

I suspect it's also because there isn't a lot of data to train on.

newyankee•2mo ago

That is amazing if they can do all of this at < 10 % of the cost of frontier labs. Off course they work in the shadows of the great work done in the frontier labs and shared, but there is some exceptional high speed execution happening behind the scenes that shows this is clearly a race, but a race where China is happy to be #2 as long as the gap is not significant and the costs are reasonable

K0balt•2mo ago

Frankly, I am pleasantly surprised to see that being a relatively close number two seems to be both practical and is turning out to be enormously beneficial to humanity. I am concerned that deep secrecy on OAIs part could change that, but it’s also possible that the genie is sufficiently out of the bottle that it no longer would be practical.

dwohnitmok•2mo ago

Is everyone just glossing over the first place score of 118/120 on the Putnam?! I mean we'll see how it does on the upcoming 2025 test, but that's insane!

We've seen absolutely ridiculous progress in model capability over the past year (which is also quite terrifying).

N_Lens•2mo ago

Also the impressive IMO-ProofBench Basic benchmark, the model achieved nearly 99% accuracy, though it fell slightly behind Gemini Deep Think on the Advanced subset.

The approach shifts from "result-oriented" to "process-oriented" verification, particularly important for theorem proving where rigorous step-by-step derivation matters more than just numerical answers.

AlexCoventry•2mo ago

"Process-oriented" verification has been a thing for a while in mathematical reasoning CoT. Google had a paper about it last year [1]. The key term to look for is "Process-reward model." I particularly like RL Tango [2].

[1] https://arxiv.org/abs/2406.06592

[2] https://arxiv.org/abs/2505.15034

Davidzheng•2mo ago

I think serious math research progress should come in 1-2 years. It basically only depends on how hard informal verification is, because training data should be not a problem and if informal verification is easy you can throw RL compute at it until it improves.

trenchgun•2mo ago

LLMs are already a powerful tool for serious math researchers, just not at the level of "fire and forget", where they would completely replace mathematicians.

hooloovoo_zoo•2mo ago

For one thing, it's not a real score; they judged the results themselves and Putnam judges are notoriously tough. There was not a single 8 on the problem they claim partial credit for (or any partial credit above a 2) amongst the top 500 humans. https://kskedlaya.org/putnam-archive/putnam2024stats.html.

For another thing, the 2024 Putnam problems are in their RL data.

Also, it's very unclear how these competitions consisting of problems designed to have clear-cut answers and be solved by (well-prepared) humans in an hour will translate to anything else.

westurner•2mo ago

What do other models trained on the same problems score? What about if they are RL'd to not reproduce things word for word?

Why do you think that the 2024 Putnam programs that they used to test were in the training data?

/? "Art of Problem Solving" Putnam https://www.google.com/search?q=%22Art+of+Problem+Solving%22...

From p.3 of the PDF:

> Curating Cold Start RL Data: We constructed our initial training data through the following process:

> 1. We crawled problems from Art of Problem Solving (AoPS) contests , prioritizing math olympiads, team selection tests, and post-2010 problems explicitly requiring proofs, total- ing 17,503 problems.

isotypic•2mo ago

> Why do you think that the 2024 Putnam programs that they used to test were in the training data?

Putnam solutions can be found multiple places online: https://kskedlaya.org/putnam-archive/, https://artofproblemsolving.com/community/c3249_putnam. These could have appeared in the training of the base LLM DeepSeek-V3.2-Exp or as problems in the training set - they do not give further detail on what problems they selected from AOPS and as the second link gives they are there.

hooloovoo_zoo•2mo ago

> Why do you think that the 2024 Putnam programs that they used to test were in the training data?

They reference https://artofproblemsolving.com/community/c13_contest_collec... for the source of their scrape and the Putnam problems are on that page under 'Undergraduate Contests'.

gunalx•2mo ago

If i read it right it used multiple samples of itself to verify the aqccuracy, but isnt this problematic?

zamadatix•2mo ago

Problematic in that it's still not formal verification, not problematic as in "it's worse to do this than not".

viraptor•2mo ago

In what way? Panel of experts approach has been a thing for a while now and it's documented to improve quality.

gunalx•2mo ago

Well problematic because they are using their own verifier as apanem of experts, with their own model trained specifically to satisfy this verifier. On the benchmark runs, they dont mention using human experts to cross validate their scores.

cubefox•2mo ago

I assume they use self-verification only during RL training to provide the reward signal, but not for benchmarks.

N_Lens•2mo ago

The core innovation is a verifier-generator dual architecture that enables the model to self-check reasoning rigor, addressing the fundamental problem that correct answers don't guarantee correct reasoning processes.

energy123•2mo ago

The thing that stands out is fine-tuning a verifier with human labels specifically so that it isn't sycophantic in either direction. If you've ever tried to do a verifier in a multi-agent system you'll recognize the annoyance of the verifier swinging wildly from "this is brilliant" to "this is trash" based on nothing more than fudging a few suggestive words in the candidate answer it's tasked with reviewing. Making the verifier invariant to those fudge words and forcing it to actually reason (... as per Anthropic's interpretability work) would be quite nice.

mekpro•2mo ago

How this improvement translate into real world agentic coding task ?

ogogmad•2mo ago

It doesn't. However, having a free-of-charge maths genius available 24/7 has broad potential. It's hard to predict what it will be used for.

indolering•2mo ago

It would be helpful in automating the busy work of many verification aware programming languages. At least the Dafny authors are excited about it.

nextos•2mo ago

IMHO, this remains a great space to explore. You type some formal specification in e.g. Hoare logic, and a mix of SAT/SMT and LLMs autocomplete it. Correct by definition.

It would also facilitate keeping engineers in the loop, who would decompose the problem into an appropriate set of formally specified functions.

They could also chip in when necessary to complete difficult proofs or redefine the functions.

naasking•2mo ago

Another possibility is to automatically annotate a software with assertions, preconditions, postconditions or other verification annotations based on the languages semantics and programmer intent, and then run a verifier on the result and evolve the program and annotations based on that intent. So for C, it could fill in data needed by Frama-C.

simulator5g•2mo ago

This already exists: https://www.wolframalpha.com/

ogogmad•2mo ago

Since you're bad at maths, you think being good at maths is being a calculator like WolframAlpha.

jimmy76615•2mo ago

Amazing model! I'm trying to get it to run on an ec2 machine right now, but it looks like a lot of the performance actually depends on more than just classical LLM inference. And it looks like Deepseek didn't share their scripts to do the parallel thinking traces and self-verification loops. Is anybody else working on recreating this right now?

hackpert•2mo ago

Hi! Did you ever end up running this reproduction? If yes, could you also check if the Putnam/IMO problems are in the training data perhaps by trying to have it complete the problems n times? I would totally do this myself if I weren’t GPU poor!

createaccount99•2mo ago

This habit of making advertising repos on github confounds many.

Show HN: ZigZag – A Bubble Tea-Inspired TUI Framework for Zig

Metaphor+Metonymy: "To love that well which thou must leave ere long"(Sonnet73)

Show HN: Django N+1 Queries Checker

Emacs-tramp-RPC: High-performance TRAMP back end using JSON-RPC instead of shell

Protocol Validation with Affine MPST in Rust

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Show HN: Zest – A hands-on simulator for Staff+ system design scenarios

Show HN: DeSync – Decentralized Economic Realm with Blockchain-Based Governance

Automatic Programming Returns

Why Are There Still So Many Jobs? The History and Future of Workplace Automation [pdf]

The Search Engine Map

Show HN: Souls.directory – SOUL.md templates for AI agent personalities

Real-Time ETL for Enterprise-Grade Data Integration

Economics Puzzle Leads to a New Understanding of a Fundamental Law of Physics

Switzerland's Extraordinary Medieval Library

A new comet was just discovered. Will it be visible in broad daylight?

ESR: Comes the news that Anthropic has vibecoded a C compiler

Frisco residents divided over H-1B visas, 'Indian takeover' at council meeting

If CNN Covered Star Wars

Show HN: I built the first tool to configure VPSs without commands

AI agents from 4 labs predicting the Super Bowl via prediction market

EU bans infinite scroll and autoplay in TikTok case

Benchmarking how well LLMs can play FizzBuzz

Why I Joined OpenAI

Octave GTM MCP Server

Show HN: Portview what's on your ports (diagnostic-first, single binary, Linux)

Voyager CEO says space data center cooling problem still needs to be solved

Boilerplate Tax – Ranking popular programming languages by density

Zen: A Browser You Can Love

My GPT-5.3-Codex Review: Full Autonomy Has Arrived

Show HN: ZigZag – A Bubble Tea-Inspired TUI Framework for Zig

Metaphor+Metonymy: "To love that well which thou must leave ere long"(Sonnet73)

Show HN: Django N+1 Queries Checker

Emacs-tramp-RPC: High-performance TRAMP back end using JSON-RPC instead of shell

Protocol Validation with Affine MPST in Rust

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Show HN: Zest – A hands-on simulator for Staff+ system design scenarios

Show HN: DeSync – Decentralized Economic Realm with Blockchain-Based Governance

Automatic Programming Returns

Why Are There Still So Many Jobs? The History and Future of Workplace Automation [pdf]

The Search Engine Map

Show HN: Souls.directory – SOUL.md templates for AI agent personalities

Real-Time ETL for Enterprise-Grade Data Integration

Economics Puzzle Leads to a New Understanding of a Fundamental Law of Physics

Switzerland's Extraordinary Medieval Library

A new comet was just discovered. Will it be visible in broad daylight?

ESR: Comes the news that Anthropic has vibecoded a C compiler

Frisco residents divided over H-1B visas, 'Indian takeover' at council meeting

If CNN Covered Star Wars

Show HN: I built the first tool to configure VPSs without commands

AI agents from 4 labs predicting the Super Bowl via prediction market

EU bans infinite scroll and autoplay in TikTok case

Benchmarking how well LLMs can play FizzBuzz

Why I Joined OpenAI

Octave GTM MCP Server

Show HN: Portview what's on your ports (diagnostic-first, single binary, Linux)

Voyager CEO says space data center cooling problem still needs to be solved

Boilerplate Tax – Ranking popular programming languages by density

Zen: A Browser You Can Love

My GPT-5.3-Codex Review: Full Autonomy Has Arrived

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning [pdf]

Comments