Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.
Considering OpenAI can't currently analyse and provide real paper sources to cutting edge scientific issues, I wouldn't trust it to do actual research outside of generating matplotlib code.
For those following along but without math specific experience: consider whether your average CS professor could solve a top competitive programming question. Not Leetcode hard, Codeforces hard.
Here's an example problem 5:
Let a1,a2,…,an be distinct positive integers and let M=max1≤i<j≤n.
Find the maximum number of pairs (i,j) with 1≤i<j≤n for which (ai +aj )(aj −ai )=M.
Every time an LLM reaches a new benchmark there’s a scramble to downplay it and move the goalposts for what should be considered impressive.
The International Math Olympiad was used by many people as an example of something that would be too difficult for LLMs. It has been a topic of discussion for some time. The fact that an LLM has achieved this level of performance is very impressive.
You’re downplaying the difficulty of these problems. It’s called international because the best in the entire world are challenged by it.
This is a ridiculous understatement of the difficulty of getting gold at the IMO.
It's really hard.
See my other comment. I was voted the best at math in my entire high school by my teachers, completed the first two years of college classes while still in high school. I've tried IMO problems for fun. I'm very happy if I get one right. I'd be infinitely satisfied to score a perfect on 3 out of 6 problems and that's nowhere near gold.
Many of them are also questions that eventually end up with proofs or solutions that only require very high level understanding of basic principles. But when I say very high I mean like impossibly high for the average person and ability to combine simple concepts to solve complex problems.
I'd wager the majority of Math graduates from universities would struggle to answer most IMO questions.
Either you are unfamiliar with the International Math Olympiad or you’re trying to be misleading.
Calling these problems high school/early university maths is a ridiculous characterization.
In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.
He thought there was an 8% chance of this happening.
Eliezer Yudkowsky said "at least 16%".
Source:
https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...
We may certainly hope Eliezer's other predictions don't prove so well-calibrated.
The more prior predictive power of human agents imply the more a posterior acceleration of progress in LLMs (math capability).
Here we are supposing that the increase in training data is not the main explanatory factor.
This example is the gem of a general framework for assessing acceleration in LLM progress, and I think its application to many data points could give us valuable information.
(1) Bad prior prediction capability of humans imply that result does not provide any information
(2) Good prior prediction capability of humans imply that there is acceleration in math capabilities of LLMs.
Second, happy to test it on open math conjectures or by attempting to reprove recent math results.
For 2, there's an army of independent mathematicians right now using automated theorem provers to formalise more or less all mathematics as we know it. It seems like open conjectures are chiefly bounded by a genuine lack of new tools/mathematics.
Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.
This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.
The answers are not in the training data.
This is not a model specialized to IMO problems.
I agree with you. However, would a lot of working mathematicians score gold level without the IMO time constraints? Working mathematicians generally are not trying to solve a problem in the time span of one hour. I would argue that most working mathematicians, if given an arbitrary IMO problem and allowed to work on it for a week, would solve it. As for "gold level", with IMO problems you either solve one or you don't.
You could counter that it is meaningless to remove the time constraints. But we are comparing humans with OpenAI here. It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds. When we talk about a chatbot achieving human-level performance, it's understood that the time is not a constraint on the human side. We are only concerned with the quality of the human output. For example: can OpenAI write a novel at the level of Jane Austen? Maybe it can, maybe it can't (for now) but Jane Austen was spending years to write such a novel, while our expectation is for OpenAI to do it at the speed of multiple words per second.
Addendum: Actually I am not sure the probability of solving it in a week is not much better than 6 hours for these questions because they are kind of random questions. But I agree with some parts of your post tbf.
Really? My expectation would have been the opposite, that time was a constraint for the AIs. OpenAI's highest end public reasoning models are slow, and there's only so much that you can do by parallelization.
Understanding how they dealt with time actually seems like the most important thing to put these results into context, and they said nothing about it. Like, I'd hope they gave the same total time allocation for a whole problem set as the human competitors. But how did they split that time? Did they work on multiple problems in parallel?
I grew up in a relatively underserved rural city. I skipped multiple grades in math, completed the first two years of college math classes while in high school, and won the award for being the best at math out of everyone in my school.
I've met and worked with a few IMO gold medalists. Even though I was used to scoring in the 99th percentile on all my tests, it felt like these people were simply in another league above me.
I'm not trying to toot my own horn. I'm definitely not that smart. But it's just ridiculous to shoot down the capabilities of these models at this point.
(Not to take away from the result, which I'm really impressed by!)
https://deepmind.google/discover/blog/alphago-zero-starting-...
They are all outstanding mathematicians, but the IMO type questions are not something that mathematicians can universally solve without preparation.
There are of course some places that pride themselves on only taking “high scoring” mathematicians, and people will introduce themselves with their name and what they scored on the Putnam exam. I don’t like being around those places or people.
My second degree is in mathematics. Not only can I probably not do these but they likely aren’t useful to my work so I don’t actually care.
I’m not sure an LLM could replace the mathematical side of my work (modelling). Mostly because it’s applied and people don’t know what they are asking for, what is possible or how to do it and all the problems turn out to be quite simple really.
E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig
Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.
Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.
We’re talking about Sam Altman’s company here. The same company that started out as a non profit claiming they wanted to better the world.
Suggesting they should be given the benefit of the doubt is dishonest at this point.
How do you know?
It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.
Edit: Fixed P4 -> P3. Thanks.
Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.
If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.
According to the twitter thread, the model was not given access to tools.
That entirely depends on who did the cherry picking. If the LLM had 10000 attempts and each time a human had to falsify it, this story means absolutely nothing. If the LLM itself did the cherry picking, then this is just akin to a human solving a hard problem. Attempting solutions and falsifying them until the desired result is achieved. Just that the LLM scales with compute, while humans operate only sequentially.
We kind of have to assume it didn't right? Otherwise bragging about the results makes zero sense and would be outright misleading.
why would not they? what are the incentives not to?
You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.
Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.
This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.
Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.
If you're not familiar with System 1 / System 2, it's googlable .
This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.
Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.
Which is greater, 9.11 or 9.9?
/sI kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.
We are simply greasing the grooves and letting things slide faster and faster and calling it progress. How does this help to make the human and nature integration better?
Does this improve climate or make humans adapt better to changing climate? Are the intelligent machines a burning need for the humanity today? Or is it all about business and political dominance? At what cost? What's the fall out of all this?
No human has any idea how to accomplish that. If a machine could, we would all have much to learn from it.
It’s an annual human competition.
I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.
I wouldn't trust these results as it is. Considering that there are trillions of dollars on the line as a reward for hyping up LLMs, I trust it even less.
Billion dollar companies stealing not only the price, prestige, time and sleep of participants by brute-forcing their model through all illegally scraped Code via GitHub is a disgrace to humanity.
AI models should read the same materials to become proficient in coding, without having trillions of lines of code to ape through mindlessly. Otherwise the "AI" is no different than an elaborate Monte Carlo Tree Search (MCTS).
Yes I know AI is quite advanced. I know that quite well and study latest SOTA papers daily, have developed my own models aswell from the ground up, but it's despite all the advancements still far away from substantially being better than MCTS (see: https://icml.cc/virtual/2025/poster/44177 and https://allenai.org/blog/autods )
EDIT, adding proof:
This is the results of the last competition they tried to win and have LOST: https://arstechnica.com/ai/2025/07/exhausted-man-defeats-ai-...
(Looks like a pattern OpenAI Corp is scraping competitions to place themselves into the spotlight and headlines.)
- "[AI is] far away from being substantially being better than MCTs"
^ pick only one
Whether or not they’re far away from being better than humans is up to debate, but the entire point of these types of benchmarks it to compare them to humans.
Yeah same way computers and robots should be able to win World Chess Championship, 100m dash and Wimbledon.
>>but the entire point of these types of benchmarks it to compare them to humans
The entire point of the competition is to fight against participants who are similar to you, have similar capabilities and go through similar struggles. If you want bot vs human competitions - great - organize it yourself instead of hijacking well established competitions out there.
Under duress? At a company like this, all of the people working on this project are there because they want to be and they’re compensated millions.
BTW; “Gold medal performance “ looks a promotional term for me.
However, I expect that geometric intuition may still be lacking mostly because of the difficulty of encoding it in a form which an LLM can easily work with. After all, Chatgpt still can't draw a unicorn [1] although it seems to be getting closer.
Not sure there is a good writeup about it yet but here is the livestream: https://www.youtube.com/live/TG3ChQH61vE.
Now it is just doing a bunch of tweets?
And many other things
> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.
> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.
> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.
I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.
I’ll wait to see third party verification and/or use it myself before judging. There’s a lot of incentives right now to hype things up for OpenAI.
It is totally fair to discount OpenAI's statement until we have way more details about their setup, and maybe even until there is some level of public access to the model. But you're doing something very different: claiming that their results are fraudulent and (incorrectly) using the Matharena results as your proof.
Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to. Noam Brown is a well known researcher in the field and I see no reason to doubt these claims other than a vague distrust of OpenAI or big tech employees generally which I reject.
What? This is a claim with all the trust-worthiness of OpenAI's claim. I mean I can claim anything I want at this point and it would still be just as trust-worthy as OpenAI's claim, with exactly zero details about anything else than "we did it, promise".
>GPT5 soon
>it will not be as good as this secret(?) model
There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.
More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.
The Pro AI crowd, VC, tech CEOs etc have strong incentive to claim humans are obsolete. Tech employees see threats to their jobs and want to poopoo any way AI could be useful or competitive.
The "It will just get better" is bubble baiting the investors. The tech companies learned from the past and they are riding and managing the bubble to extract maximum ROI before it pops.
The reality is a lot of work done by humans can be replaced by an LLM with lower quality and nuance. The loss in sales/satisfaction/ect is more than offset by the reduced cost.
The current model of LLMs are enshitification accelerators and that will have real effects.
I do not see that at all in this comment section.
There is a lot of denial and cynicism like the parent comment suggested. The comments trying to dismiss this as just “some high school math problem” are the funniest example.
You say, "explaining away the increasing performance" as though that was a good faith representation of arguments made against LLMs, or even this specific article. Questionong the self-congragulatory nature of these businesses is perfectly reasonable.
tester756•4h ago
any details?
ktallett•4h ago
littlestymaar•4h ago
I mean it is quite impressive how language models are able to mobilize the knowledge they have been trained on, especially since they are able to retrieve information from sources that may be formatted very differently, with completely different problem statement sentences, different variable names and so on, and really operate at the conceptual level.
But we must wary of mixing up smart information retrieval with reasoning.
ktallett•4h ago
Jcampuzano2•2h ago