frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Benchmarks in Leipzig

https://arxiv.org/abs/2606.05818
45•root-parent•1h ago

Comments

root-parent•1h ago
"...Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers... We present the resulting collection of 100 questions....We evaluated these questions in three stages: a single attempt by five state-of-the-art LLMs....we concluded Stage 3 with only 2 unsolved questions. This demonstrates that the mathematical reasoning capabilities of LLMs are becoming impressive..."
rabidvermin•1h ago
mathematics questions with known answers...

... that are therefore liable to be in the training data?

fc417fc802•1h ago
I had the same thought, because even if the exact solution doesn't appear there's a notable difference between performing a literature search versus solving something de novo. But I think perhaps this benchmark wasn't meant to exclude the former and that the point may have been to test the ability of the model to accurately interpret and synthesize relevant output for research level mathematical problems at all.
tossandthrow•54m ago
I can recommend reading section 2 of the paper.

The goal was not to define unsolved problems.

But as such, the problems are also not previously published problems.

This seems quite reasonable IMHO.

christianstump•12m ago
I think you are underestimating the complexity of such problems. A PhD in the exact field of research would need days to weeks to understand what the problem means and how to solve it. This is far beyond "throwing standard techniques" at a problem. (But, I keep emphasizing this, it is also far away from solving research mathematics.)
fc417fc802•7m ago
What did I say that led you to believe I was underestimating the complexity? I don't believe I commented on it at all.
andy99•58m ago
“In the training data” isn’t really relevant for a modern LLM. The better question would be are they solvable using known techniques that have been fine-tuned in.

A simple example, as a non-mathematician: I’d expect a well trained LLM to be able to solve any integral that can be solved with integration by parts. I would be much more interested to see it solve one with no know solution using some novel technique.

Obviously this doesn’t really lend itself to making a benchmark, but if something is solveable by a known technique, and the LLM has has some kind of RL training re using that technique, seeing a solution isn’t too surprising.

criemen•57m ago
Partially, 2.2 Submission workflow W2 deals with this:

> Stage W2 The five project-active models, see Table 2, attempted the question. Their answers were compared to the original answer by an LLM judge. If at most three models answered correctly, the contributor could proceed.

So "trivially contained in the training data" is excluded, as then all models could/should easily come up with the solution.

zerobees•1h ago
I know that people with strong feelings one way or the other will comment here, but note that this is specifically about problems with known answers that can be found in public data (and thus, likely in the training material).

This is an interesting result, but as I understand it, it's not about solving frontier challenges (which LLMs can evidently do too, but that's not what's tested here). It's closer to "can a mathematician (blindly) write exam questions you can't cheat on using an LLM". "Blindly" in the sense that they can't adjust the problem ahead of the time until they get a model to fail.

The conclusion in the paper is: "The concept of writing exercise-style benchmark questions based on publicly accessible research has reached its limits when it comes to the best-performing available models."

lightningspirit•38m ago
I think most of the value LLMs provide comes from connecting the dots between unsolved questions and patterns or structures that have already been demonstrated, which accelerates research.

Now, reasoning in the sense of making truly original discoveries, as Einstein did with the field equations, is a different story for current LLMs.

christianstump•22m ago
Let me also add: there is zero chance of the problems being included in the training data. The results are quite impressive: leading experts struggled to write questions with well-defined unique answers on existing research that the models were not able to solve.

This should not be interpreted as AI can solve mathematics: the ability to solve exercise-style questions based on existing research is vastly different from the creation of new mathematics.

But it is still impressive and not what we expected -- I rather expected that we end with 20-40 questions no current publicly available model can solve.

Towaway69•57m ago
As long as it's not conscious, we're safe.
qsort•54m ago
These are the results from the website they link in the paper:

https://math.sciencebench.ai/benchmarks

I take the "2 unsolved" claim to mean "not solved by any model in any configuration in any stage with any number of attempts", the "benchmark results" are much lower. To be clear: it's extremely impressive, I still remember I was in utter disbelief when models started solving AIME problems, and this is obviously several levels above that.

It's also interesting that OpenAI models perform that much better on math and math-adjacent stuff. I assume this comes down to differences in post-training?

tux3•41m ago
If you're trying to compare what the models are good at, important to note that the different models did not run with the same settings. In one case they also retried with GPT until it answered all the problems but did not retry with the other models.

GPT has 5 effort settings and they picked the highest (xhigh). Claude has 5 and they picked the middle one to avoid having to retry when it timed out. Gemini has medium or high effort and they picked medium.

christianstump•27m ago
the difference between gpt and gemini concerning the "retry until..." can almost be ignored. I did rerun gpt a few times, but still way below what gemini was not able to answer at all.
christianstump•29m ago
I am the leader of the study and the author of the benchmark paper: let me add: the problems are much harder than any exam question in any exam.

Think of it as: a PhD student studying exactly this area of mathematics would need days to weeks to understand and solve the question.

But nonetheless, these are questions about existing research, but much closer to a question given a second-year PhD student than to an exam question.

christianstump•28m ago
But it still remains far away from mathematics research. Solving any of the problems would not result in a new research paper.
sajithdilshan•9m ago
What would have been more interesting is if LLMs were tested with questions where the direct solutions are not publicly available (so not in training data). In that case I wonder how much of hallucinations would happen or if it tries to connect dots with what’s available publicly and come up with a direct solution
puttycat•8m ago
Hopefully they password-protect the datasets:

https://arxiv.org/abs/2305.10160

spuz•4m ago
As well as measuring how many questions each model was able to answer correctly, I think it's equally important to measure how many questions each model answered incorrectly. After all, if you consider using them as a tool, you will need to have confidence that any answer they give is correct.

If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:

- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)

- Opus: 1306/2000 questions answered, of which 294 were correct (22%)

So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.

Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.

Moving beyond fork() + exec()

https://lwn.net/SubscriberLink/1076018/16f01bbbb8e0d1f0/
25•jwilk•47m ago•12 comments

Benchmarks in Leipzig

https://arxiv.org/abs/2606.05818
46•root-parent•1h ago•20 comments

How LLMs work

https://www.0xkato.xyz/how-llms-actually-work/
605•0xkato•2d ago•173 comments

Pokemon Emerald Ported to WebAssembly (100k FPS)

https://pokeemerald.com/
79•tripplyons•4h ago•23 comments

Building Rust Procedural Macros from the Grounds Up

https://www.learnix-os.com/ch02-03-implementing-the-bitfields-proc-macro.html
28•Sagi21805•5d ago•1 comments

The new bibliomaniacs

https://engelsbergideas.com/notebook/the-new-bibliomaniacs/
32•RickJWagner•3h ago•22 comments

Tribute to Jiro Yamada, Automotive Artist (1960-2025) [video]

https://www.youtube.com/watch?v=rJ2gQ5Md60U
13•NaOH•18h ago•0 comments

S&P 500 rejects SpaceX, also blocking entry for OpenAI and Anthropic

https://arstechnica.com/tech-policy/2026/06/sp-500-blocks-fast-spacex-entry-wont-waive-rule-for-u...
949•maltalex•10h ago•343 comments

The intracies of modern camera lens repair (2024)

https://salvagedcircuitry.com/sigma-45mm.html
208•transistor-man•14h ago•71 comments

New method turns ocean water into drinking water, without waste

https://www.rochester.edu/newscenter/what-is-desalination-definition-ocean-water-704732/
449•speckx•1d ago•185 comments

Azure Linux Desktop

https://www.boxofcables.dev/azure-linux-desktop-a-build-2026-mashup-of-wslc-winui-reactor-and-azu...
54•haydenbarnes•7h ago•30 comments

Mbodi AI (YC P25) Is Hiring Founding Machine Learning Engineer (Robotics)

https://www.ycombinator.com/companies/mbodi-ai/jobs/WYAcNkX-founding-machine-learning-engineer
1•chitianhao•3h ago

Ask HN: What was your "oh shit" moment with GenAI?

436•andrehacker•1d ago•772 comments

pg_durable: Microsoft open sources in-database durable execution

https://github.com/microsoft/pg_durable
437•coffeemug•23h ago•98 comments

Social Cache Busting

https://www.autodidacts.io/social-cache-busting/
82•surprisetalk•4d ago•26 comments

Pre-Modern Armies for Worldbuilders, Part I: Why They Fight

https://acoup.blog/2026/06/05/collections-pre-modern-armies-for-worldbuilders-part-i-why-they-fight/
125•gostsamo•11h ago•39 comments

Astronauts told to return to ISS after sheltering over air leak repairs

https://www.bbc.com/news/live/c4g44ew3g1kt
412•janpot•1d ago•254 comments

Show HN: Soft Body Jiggle Physics

https://github.com/xloveee/jiggle-physics
7•vesperance•4d ago•4 comments

US House lawmakers release draft bill to prohibit state AI rules

https://www.reuters.com/business/us-house-lawmakers-release-draft-bill-regulate-ai-2026-06-04/
23•1vuio0pswjnm7•1h ago•6 comments

Did Claude increase bugs in rsync?

https://alexispurslane.github.io/rsync-analysis/
476•logicprog•1d ago•479 comments

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gem...
372•theanonymousone•23h ago•113 comments

Mouseless – keyboard-driven control of macOS/Linux/Windows

https://mouseless.click
556•riddley•3d ago•225 comments

The Smart TV in Your LivingRoom Is a Node in the AIScraping Economy

https://blog.includesecurity.com/2026/06/the-smart-tv-in-your-livingroom-is-a-node-in-the-aiscrap...
128•nikcub•6h ago•40 comments

HISE – Toolkit for building VST plugins

https://hise.dev
15•hyperific•2d ago•2 comments

The back cover of C++: The Language raises questions not answered by front cover

https://devblogs.microsoft.com/oldnewthing/20260605-01/?p=112391
128•paulmooreparks•11h ago•43 comments

Meta Keeps Delaying the Release of Its New AI Model to Developers

https://www.wsj.com/tech/ai/meta-keeps-delaying-the-release-of-its-new-ai-model-to-developers-f85...
30•mekpro•3h ago•10 comments

My Agent Skill for Test-Driven Development

https://www.saturnci.com/my-agent-skill-for-test-driven-development.html
209•laxmena•2d ago•92 comments

Ten Years of Franz

https://meetfranz.com/blog/ten-years-of-franz
54•tosh•3d ago•28 comments

Gov.uk has replaced Stripe with Dutch provider Adyen

https://www.theregister.com/public-sector/2026/06/04/govuk-goes-dutch-on-payments-as-it-dumps-str...
526•toomuchtodo•22h ago•200 comments

Introduction – Rust for Python Programmers

https://microsoft.github.io/RustTraining/python-book/
50•linhns•4h ago•20 comments