Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.
For those following along but without math specific experience: consider whether your average CS professor could solve a top competitive programming question. Not Leetcode hard, Codeforces hard.
And 1% here are those IMO/IOI winners who think everyone is just like them. I grew up with them and to you, my friends, I say: this is the reason why AI would not take over the world (and might even not be that useful for real world tasks), even if it wins every damn contest out there.
Here's an example problem 5:
Let a1,a2,…,an be distinct positive integers and let M=max1≤i<j≤n.
Find the maximum number of pairs (i,j) with 1≤i<j≤n for which (ai +aj )(aj −ai )=M.
Every time an LLM reaches a new benchmark there’s a scramble to downplay it and move the goalposts for what should be considered impressive.
The International Math Olympiad was used by many people as an example of something that would be too difficult for LLMs. It has been a topic of discussion for some time. The fact that an LLM has achieved this level of performance is very impressive.
You’re downplaying the difficulty of these problems. It’s called international because the best in the entire world are challenged by it.
You degraded this thread badly by posting so many comments like this.
You degraded this thread badly by posting so many comments like this.
In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.
He thought there was an 8% chance of this happening.
Eliezer Yudkowsky said "at least 16%".
Source:
https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...
We may certainly hope Eliezer's other predictions don't prove so well-calibrated.
The more prior predictive power of human agents imply the more a posterior acceleration of progress in LLMs (math capability).
Here we are supposing that the increase in training data is not the main explanatory factor.
This example is the gem of a general framework for assessing acceleration in LLM progress, and I think its application to many data points could give us valuable information.
(1) Bad prior prediction capability of humans imply that result does not provide any information
(2) Good prior prediction capability of humans imply that there is acceleration in math capabilities of LLMs.
You definitely should assume they are. They are rationalists, the modus operandi is to pull stuff out of thin air and slap a single digit precision percentage prediction in front to make it seems grounded in science and well thought out.
The point of giving such estimates is mostly an exercise in getting better at understanding the world, and a way to keep yourself honest by making predictions in advance. If someone else consistently gives higher probabilities to events that ended up happening than you did, then that's an indication that there's space for you to improve your prediction ability. (The quantitative way to compare these things is to see who has lower log loss [1].)
This sounds like a circular argument. You started explaining why them giving percentage predictions should make them more trustworthy, but when looking into the details, I seem to come back to 'just trust them'.
People's bets are publicly viewable. The website is very popular with these "rationality-ists" you refer to.
I wasn't in fact arguing that giving a prediction should make people more trustworthy, please explain how you got that from my comment? I said that the main benefit to making such predictions is as practice for the predictor themselves. If there's a benefit for readers, it is just that they could come along and say "eh, I think the chance is higher than that". Then they also get practice and can compare how they did when the outcome is known.
Clowns, mostly. Yudkowski in particular, whose only job today seems to be making awful predictions and letting lesswrong eat it up when one out of a hundred ends up coming true, solidifying his position as AI-will-destroy-the-world messiah. They make money from these outlandish takes, and more money when you keep talking about them.
It's kind of like listening to the local drunkard at the bar that once in a while ends up predicting which team is going to win in football inbetween drunken and nonsensical rants, except that for some reason posting the predictions on the internet makes him a celebrity, instead of just a drunk curiosity.
https://en.wikipedia.org/wiki/Brier_score
also:
That is not my experience talking with rationalists irl at all. And that is precisely my issue, it is pervasive in every day discussion about any topic, at least with the subset of rationalists I happen to cross paths with. If it was just for comparing ability to forecast or for bets, then sure it would make total sense.
Just the other day I had a conversation with someone about working in AI safety, it when something like "well I think there is 10 to 15% chance of AGI going wrong, and if I join I have maybe 1% chance of being able to make an impact and if.. and if... and if, so if we compare with what I'm missing by not going to <biglab> instead I have 35% confidence it's the right decision"
What makes me uncomfortable with this, is that by using this kind of reasoning and coming out with a precise figure at the end, it cognitively bias you into being more confident in your reasoning than you should be. Because we are all used to treat numbers as the output of a deterministic, precise, scientific process.
There is no reason to say 10% or 15% and not 8% or 20% for rogue AGI, there is no reason to think one individual can change the direction by 1% and not by 0.3% or 3%, it's all just random numbers, and so when you multiply a gut feeling number by a gut feeling number 5 times in a row, you end up with something absolutely meaningless, where the margin of error is basically 100%.
But it somehow feels more scientific and reliable because it's a precise number, and I think this is dishonest and misleading both to the speaker themselves and to listeners. "Low confidence", or "im really not sure but I think..." have the merit of not hiding a gut feeling process behind a scientific veil.
To be clear I'm not saying you should never use numerics to try to quantify gut feeling, it's ok to say I think there is maybe 10% chance of rogue AGI and thus I want to do this or that. What I really don't like is the stacking of multiple random predictions and trying to reason about this in good faith.
> And the 16.27% mockery is completely dishonest.
Obviously satire
Take a look at this paper: https://scholar.harvard.edu/files/rzeckhauser/files/value_of...
They took high-precision forecasts from a forecasting tournament and rounded them to coarser buckets (nearest 5%, nearest 10%, nearest 33%), to see if the precision was actually conveying any real information. What they found is that if you rounded the forecasts of expert forecasters, Brier scores got consistently worse, suggesting that expert forecast precision at the 5% level is still conveying useful, if noisy, information. They also found that less expert forecasters took less of a hit from rounding their forecasts, which makes sense.
It's a really interesting paper, and they recommend that foreign policy analysts try to increase precision rather than retreating to lumpy buckets like "likely" or "unlikely".
Based on this, it seems totally reasonable for a rationalist to make guesses with single digit precision, and I don't think it's really worth criticizing.
And 16% very much feels ridiculous to a reader when they could've just said 15%.
Whether rationalists who are publicly commenting actually achieve that level of reliability is an open question. But that humans can be reliable enough in the real world that the last percentage matters, has been demonstrated.
-log_2(.15/(1-.15)) -> -log_2(.16/1-.16))
=
2.5 -> 2.39
So saying 16% instead of 15% implies an additional tenth of a bit of evidence in favor (alternatively, 16/15 ~= 1.07 ~= 2^.1).
I don't know if I can weigh in on whether humans should drop a tenth of a bit of evidence to make their conclusion seem less confident. In software (eg. spam detector), dropping that much information to make the conclusion more presentable would probably be a mistake.
If the variance (uncertainty) in a number is large, correct thing to do is to just also report the variance, not to round the mean to a whole number.
Also, in log odds, the difference between 5% and 10% is about the same as the difference between 40% and 60%. So using an intermediate value like 8% is less crazy than you'd think.
People writing comments in their own little forum where they happen not to use sig-figs to communicate uncertainty is probably not a sinister attempt to convince "everyone" that their predictions are somehow scientific. For one thing, I doubt most people are dumb enough to be convinced by that, even if it were the goal. For another, the expected audience for these comments was not "everyone", it was specifically people who are likely to interpret those probabilities in a Bayesian way (i.e. as subjective probabilities).
I really wonder what you mean by this. If I put my finger in the air and estimate the emergence of AGI as 13%, how do I get at the variance of that estimate? At face value, it is a number, not a random variable, and does not have a variance. If you instead view it as a "random sample" from the population of possible estimates I might have made, it does not seem well defined at all.
You're absolutely right that if you have a binary random variable like "IMO gold by 2026", then the only thing you can report about its distribution is the probability of each outcome. This only makes it even more unreasonable to try and communicate some kind of "uncertainty" with sig-figs, as the person I was replying to suggested doing!
(To be fair, in many cases you could introduce a latent variable that takes on continuous values and is closely linked to the outcome of the binary variable. Eg: "Chance of solving a random IMO problem for the very best model in 2025". Then that distribution would have both a mean and a variance (and skew, etc), and it could map to a "distribution over probabilities".)
No.
I responded to the same point here: https://news.ycombinator.com/item?id=44618142
> correct thing to do is to just also report the variance
And do we also pull this one out of thin air?
Using precise number to convey extremely unprecise and ungrounded opinions is imho wrong and to me unsettling. I'm pulling this purely out of my ass, and maybe I am making too much out of it, but I feel this is in part what is causing the many cases of very weird, and borderline associal/dangerous behaviours of some associated with the rationalists movement. When you try to precisely quantify what cannot be, and start trusting those numbers too much, you can easily be led to trust your conclusions way too much. I am 56% confident this is a real effect.
In all seriousness, I do agree it's a bit harmful for people to use this kind of reasoning, but only practice it on things like AGI that will not be resolved for years and years (and maybe we'll all be dead when it does get resolved). Like ideally you'd be doing hand-wavy reasoning with precise probabilities about whether you should bring an umbrella on a trip, or applying for that job, etc. Then you get to practice with actual feedback and learn how not to make dumb mistakes while reasoning in that style.
> And do we also pull this one out of thin air?
That's what we do when training ML models sometimes. We'll have the model make a Gaussian distribution by supplying both a mean and a variance. (Pulled out of thin air, so to speak.) It has to give its best guess of the mean, and if the variance it reports is too small, it gets penalized accordingly. Having the model somehow supply an entire probability distribution is even more flexible (and even less communicable by mere rounding). Of course, as mentioned by commenter danlitt, this isn't relevant to binary outcomes anyways, since the whole distribution is described by a single number.
I am obviously only talking from my personal anecdotal experience, but having been on a bunch of coffee chat in the last few months with people in the AI safety field in SF, and a lot of them being Lesswrong-ers, I experienced a lot of those discussions with random % being thrown in succession to estimate the final probability of some event, and even though I have worked in ML for 10+ years (so I would guess more constantly aware of what a bayesian probability is than the average person), I do find myself often swayed by whatever numbers comes out at the end and having to consciously take a step back and pull myself from instinctively trusting this random number more than I should. I would not need to pull myself back, I think, if we were using words instead of precise numbers.
It could be just a personal mental weakness with numbers with me that is not general, but looking at my interlocutors emotional reactions to their own numerical predictions I do feel quite strongly that this is a general human trait.
Your feeling is correct; anchoring is a thing, and good LessWrongers (I hope to be in that category) know this and keep track of where their prior and not just posterior probabilities come from: https://en.wikipedia.org/wiki/Anchoring_effect
Probably don't in practice, but should. That "should" is what puts the "less" into "less wrong".
And since we’re at it: why not give confidence intervals too?
To add to what tedsanders wrote: there's also research that shows verbal descriptions, like those, mean wildly different things from one person to the next: https://lettersremain.com/perceptions-of-probability-and-num...
Second, happy to test it on open math conjectures or by attempting to reprove recent math results.
For 2, there's an army of independent mathematicians right now using automated theorem provers to formalise more or less all mathematics as we know it. It seems like open conjectures are chiefly bounded by a genuine lack of new tools/mathematics.
Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.
This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.
The answers are not in the training data.
This is not a model specialized to IMO problems.
E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig
Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.
Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.
We’re talking about Sam Altman’s company here. The same company that started out as a non profit claiming they wanted to better the world.
Suggesting they should be given the benefit of the doubt is dishonest at this point.
This is why HN threads about AI have become exhausting to read
Those models seem to be special and not part of their normal product line, as is pointed out in the comments here. I would assume that in that case they indeed had the purpose of passing these tests in mind when creating them. Or was it created for something different, and completely by chance they discovered they could be used for the challenge, unintentionally?
The bigger question is - why should everyone be excited by this? If they don't plan to share anything related to this AI model back to humanity.
The thing with IMO, is the solutions are already known by someone.
So suppose the model got the solutions beforehand, and fed them into the training model. Would that be an acceptable level of "cheating" in your view?
Finally, even if you aligned the model with the answers its weight shift of such an enormous model would be inconsequential. You would need to prime the context or boost the weights. All this seems like absurd lengths to go to to cheat on this one thing rather than focusing your energies on actually improving model performance. The payout for OpenAI isn’t a gold medal in the IMO it’s having a model that can get a gold medal at IMO then selling it. But it has to actually be capable of doing what’s on the tin otherwise their customers will easily and rapidly discover this.
Sorry, I like tin foil as much as anyone else, but this doesn’t seem credibly likely given the incentive structure.
> 5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.
https://x.com/alexwei_/status/1946477745627934979?s=46&t=Hov...
How do you know?
> This is not a model specialized to IMO problems.
Any proof?
There are trillions of dollars at stake in hyping up these products; I take everything these companies write with a cartload of salt.
It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.
Edit: Fixed P4 -> P3. Thanks.
There is no reason why machines would do badly on exactly the problem which humans do badly as well - without humans prodding the machine towards a solution.
Also, there is no reason why machines could not produce a partial or wrong answer to problem 6 which seems like survivor bias to me. ie, that only correct solutions were cherrypicked.
Unless the machine is trained to mimic human thought process.
There are many things that are hard for AI’s for the same reason they’re hard for humans. There are subtleties in complexity that make challenging things universal.
Obviously the model was trained on human data so its competencies lie in what other humans have provided input for over the years in mathematics, but that isn’t data contamination, that’s how all humans learn. This model, like the contestants, never saw the questions before.
Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.
If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.
According to the twitter thread, the model was not given access to tools.
That entirely depends on who did the cherry picking. If the LLM had 10000 attempts and each time a human had to falsify it, this story means absolutely nothing. If the LLM itself did the cherry picking, then this is just akin to a human solving a hard problem. Attempting solutions and falsifying them until the desired result is achieved. Just that the LLM scales with compute, while humans operate only sequentially.
We kind of have to assume it didn't right? Otherwise bragging about the results makes zero sense and would be outright misleading.
why would not they? what are the incentives not to?
[1] https://x.com/markchen90/status/1946573740986257614?s=46&t=H...
You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.
Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.
This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.
Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.
If you're not familiar with System 1 / System 2, it's googlable .
This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.
What in the accelerationist hell?
Not trying to be a smarty pants here, but what do we mean by "reason"?
Just to make the point, I'm using Claude to help me code right now. In between prompts, I read HN.
It does things for me such as coding up new features, looking at the compile and runtime responses, and then correcting the code. All while I sit here and write with you on HN.
It gives me feedback like "lock free message passing is going to work better here" and then replaces the locks with the exact kind of thing I actually want. If it runs into a problem, it does what I did a few weeks ago, it will see that some flag is set wrong, or that some architectural decision needs to be changed, and then implements the changes.
What is not reasoning about this? Last year at this time, if I looked at my code with a two hour delta, and someone had pushed edits that were able to compile, with real improvements, I would not have any doubt that there was a reasoning, intelligent person who had spent years learning how this worked.
It is pattern matching? Of course. But why is that not reasoning? Is there some sort of emergent behavior? Also yes. But what is not reasoning about that?
I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.
The only thing that should matter is the results they get. And I have a hard time understanding why the thing that is supposed to behave in an intelligent way but often just spew nonsense gets 10x budget increases over and over again.
This is bad software. It does not do the thing it promises to do. Software that sometimes works and very often produces wrong or nonsensical output is garbage software. Sink 10x, 100x, 1000x more resources into it is irrational.
Nothing else matters. Maybe it reasons, maybe it's intelligent. If it produces garbled nonsense often, giving the teams behind it 10x the compute is insane.
Is that very unlike humans?
You seem to be comparing LLMs to much less sophisticated deterministic programs. And claiming LLMs are garbage because they are stochastic.
Which entirely misses the point because I don't want an LLM to render a spreadsheet for me in a fully reproducible fashion.
No, I expect an LLM to understand my intent, reason about it, wield those smaller deterministic tools on my behalf and sometimes even be creative when coming up with a solution, and if that doesn't work, dream up some other method and try again.
If _that_ is the goal, then some amount of randomness in the output is not a bug it's a necessary feature!
We deal with non determinism any time our code interacts with the natural world. We build guard rails, detection, classification of false/true positive and negatives, and all that all the time. This isn’t a flaw, it’s just the way things are for certain classes of problems and solutions.
It’s not bad software - it’s software that does things we’ve been trying to do for nearly a hundred years beyond any reasonable expectation. The fact I can tell a machine in human language to do some relative abstract and complex task and it pretty reliably “understands” me and my intent, “understands” it’s tools and capabilities, and “reasons” how to bridge my words to a real world action is not bad software. It’s science fiction.
The fact “reliably” shows up is the non determinism. Not perfectly, although on a retry with a new seed it often succeeds. This feels like most software that interacts with natural processes in any way or form.
It’s remarkable that anyone who has ever implemented exponential back off and retry, has ever implemented edge cases, and sir and say “nothing else matters,” when they make their living dealing with non determinism. Because the algorithmic kernel of logic is 1% of programming and systems engineering, and 99% is coping with the non determinism in computing systems.
The technology is immature and the toolchains are almost farcically basic - money is dumping into model training because we have not yet hit a wall with brute force. And it takes longer to build a new way of programming and designing highly reliable systems in the face of non determinism, but it’s getting better faster than almost any technology change in my 35 years in the industry.
Your statement that it “very often produces wrong or nonsensical output” also tells me you’re holding onto a bias from prior experiences. The rate of improvement is astonishing. At this point in my professional use of frontier LLMs and techniques they are exceeding the precision and recall of humans and there’s a lot of rich ground untouched. At this point we largely can offload massive amounts of work that humans would do in decision making (classification) and use humans as a last line to exercise executive judgement often with the assistance of LLMs. I expect within two years humans will only be needed in the most exceptional of situations, and we will do a better job on more tasks than we ever could have dreamed of with humans. For the company I’m at this is a huge bottom line improvement far and beyond the cost of our AI infrastructure and development, and we do quite a lot of that too.
If you’re not seeing it yet, I wouldn’t use that to extrapolate to the world at large and especially not to the future.
I’m using Opus 4 for coding and there is no way that model demonstrates any reasoning or demonstrates any “intelligence” in my opinion. I’ve been through the having conversations phase etc but doesn’t get you very far, better to read a book.
I use these models to help me type less now, that’s it. My prompts basically tell it to not do anything fancy and that works well.
Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.
I haven't read the IMO problems, but knowing how math Olympiad problems work, they're probably not really "unencountered".
People aren't inventing these problems ex nihilo, there's a rulebook somewhere out there to make life easier for contest organizers.
People aren't doing these contests for money, they are doing them for honor, so there is little incentive to cheat. With big business LLM vendors it's a different situation entirely.
1) did humans formalize the input 2) did humans prompt the llm towards the solution etc..
I am excited to hear about it, but I remain skeptical.
Because if I have to throw 10000 rocks to get one in the bucket, I am not as good/useful of a rock-into-bucket-thrower as someone who gets it in one shot.
This is almost certainly the case, remember the initial o3 ARC benchmark? I could add this is probably multi-agent system as well, so the context length restriction can be bypassed.
Overall, AI good at math problems doesn't make news to me. It is already better than 99.99% of humans, now it is better than 99.999% of us. So ... ?
Which is greater, 9.11 or 9.9?
/sI kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.
We are simply greasing the grooves and letting things slide faster and faster and calling it progress. How does this help to make the human and nature integration better?
Does this improve climate or make humans adapt better to changing climate? Are the intelligent machines a burning need for the humanity today? Or is it all about business and political dominance? At what cost? What's the fall out of all this?
No human has any idea how to accomplish that. If a machine could, we would all have much to learn from it.
It’s an annual human competition.
It’s not an AI benchmark generated for AI. It was targeted at humans
The latest models can score something like 70% on SWE-bench verified and yet it’s difficult to say what tangible impact this has on actual software development. Likewise, they absolutely crush humans at sport programming but are unreliable software engineers on their own.
What does it really mean that an LLM got gold on this year’s IMO? What if it means pretty much nothing at all besides the simple fact that this LLM is very, very good at IMO style problems?
- OpenAI denied training on FrontierMath, FrontierMath-derived data, or data targeting FrontierMath specifically
- The training data for o3 was frozen before OpenAI even downloaded FrontierMath
- The final o3 model was selected before OpenAI looked at o3's FrontierMath results
Primary source: https://x.com/__nmca__/status/1882563755806281986
You can of course accuse OpenAI of lying or being fraudulent, and if that's how you feel there's probably not much I can say to change your mind. One piece of evidence against this is that the primary source linked above no longer works at OpenAI, and hasn't chosen to blow the whistle on the supposed fraud. I work at OpenAI myself, training reasoning models and running evals, and I can vouch that I have no knowledge or hint of any cheating; if I did, I'd probably quit on the spot and absolutely wouldn't be writing this comment.
Totally fine not to take every company's word at face value, but imo this would be a weird conspiracy for OpenAI, with very high costs on reputation and morale.
- A 3Blue1Brown video on a particularly nice and unexpectedly difficult IMO problem (2011 IMO, Q2): https://www.youtube.com/watch?v=M64HUIJFTZM
-- And another similar one (though technically Putnam, not IMO): https://www.youtube.com/watch?v=OkmNXy7er84
- Timothy Gowers (Fields Medalist and IMO perfect scorer) solving this year’s IMO problems in “real time”:
I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.
I wouldn't trust these results as it is. Considering that there are trillions of dollars on the line as a reward for hyping up LLMs, I trust it even less.
And AI changes so quickly that there is a breakthrough every week.
Call my cynical, but I think this is an RLHF/RLVR push in a narrow area--IMO was chosen as a target and they hired specifically to beat this "artificial" target.
BTW; “Gold medal performance “ looks a promotional term for me.
However, I expect that geometric intuition may still be lacking mostly because of the difficulty of encoding it in a form which an LLM can easily work with. After all, Chatgpt still can't draw a unicorn [1] although it seems to be getting closer.
Not sure there is a good writeup about it yet but here is the livestream: https://www.youtube.com/live/TG3ChQH61vE.
Instead of the more traditional Leetcode-like problems, it's things like optimizing scheduling/clustering according to some loss function. Think simulated annealing or pruned searches.
Now it is just doing a bunch of tweets?
And many other things
> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.
> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.
> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.
I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.
Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to. Noam Brown is a well known researcher in the field and I see no reason to doubt these claims other than a vague distrust of OpenAI or big tech employees generally which I reject.
The one OpenAI "scandal" that I did agree with was the thing where they threatened to cancel people's vested equity if they didn't sign a non-disparagement agreement. They did apologize for that one and make changes. But it doesn't have a lot to do with their research claims.
I'm open to actual evidence that OpenAI's research claims are untrustworthy, but again, I also judge people individually, not just by the organization they belong to.
This is certainly a courageous viewpoint – I imagine this makes it very hard for you to engage in the modern world? Most of us are very bound by institutions we operate in!
Me: I have a way to turn lead into gold.
You: Show me!!!
Me: NO (and then spends the rest of my life in poverty).
Cold Fusion (physics not the programing language) is the best example of why you "Show your work". This is the Valley we're talking about. It's the thudnderdome of technology and companies. If you have a meaningful breakthrough you don't talk about it you drop it on the public and flex.
When my wife tells me there's a pie in the oven and it's smelling particularly good, I don't demand evidence or disbelieve the existence of the pie. And I start to believe that it'll probably be a particularly good pie.
This is from OpenAI. Here they've not been so great with public communications in the past, and they have a big incentive in a crowded marketplace to exaggerate claims. On the other hand, it seems like a dumb thing to say unless they're really going to deliver that soon.
This is called marketing.
> When my wife tells me there's a pie in the oven and it's smelling particularly good, I don't demand evidence
Because you have evidence, it smells.
And if later your ask your wife "where is the pie" and she says "I sprayed pie scent in the air, I was just singling" how are you going to feel?
Open AI spent its "fool us once" card already. Doing things this way does not earn back trust, failure to deliver (and they have done that more than once) ... See staff non disparagement, see the math fiasco, see open weights.
Many signals are marketing, but the purpose of signals is not purely to develop markets. We all have to determine what we think will happen next and how others will act.
> Because you have evidence, it smells.
I think you read that differently than what I intended to write -- she claims it smells good.
> Open AI spent its "fool us once" card already.
> > This is from OpenAI. Here they've not been so great with public communications in the past, and they have a big incentive in a crowded marketplace to exaggerate claims.
>> Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to.
That has nothing to do with anything I said. A claim can be false without it being fraudulent, in fact most false claims are probably not fraudulent; though, still, false.
Claims are also very often contested. See e.g. the various claims of Quantum Superiority and the debate they have generated.
Science is a debate. If we believe everything anyone says automatically, then there is no debate.
"So under his saturate response, he never loses. For her to win, must make him unable at some even -> would need Q_{even-1}>even, i.e. some a_j> sqrt2. but we just showed always a_j<=c< sqrt2. So she can never cause his loss. So against this fixed response of his, she never wins (outcomes: may be infinite or she may lose by sum if she picks badly; but no win). So she does NOT have winning strategy at λ=c. So at equality, neither player has winning strategy."[1]
Why use lot word when few word do trick?
1. https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro...
A) that's not clear
B) now we have "reasoning" models that can be used to analyse the data, create n rollouts for each data piece, and "argue" for / against / neutral on every piece of data going into the model. Imagine having every page of a "short story book" + 10 best "how to write" books, and do n x n on them. Huge compute, but basically infinite data as well.
We went from "a bunch of data" to "even more data" to "basically everything we got" to "ok, maybe use a previous model to sort through everything we got and only keep quality data" to "ok, maybe we can augment some data with synthetic datasets from tools etc" to "RL goes brrr" to (point B from above) "let's mix the data with quality sources on best practices".
B) Has learning though "self-play" (like with AlphaZero etc) been demonstrated working for improving LLMs? What is the latest key research on this?
It might be a constraint on the evolution of godlike intelligence, or AGI. But at that point we're so far out in bong-hit territory that it will be impossible to say who's right or wrong about what's coming.
Has learning though "self-play" (like with AlphaZero etc) been demonstrated working for improving LLMs?
My understanding (which might be incorrect) is that this amounts to RLHF without the HF part, and is basically how DeepSeek-R1 was trained. I recall reading about OpenAI being butthurt^H^H^H^H^H^H^H^H concerned that their API might have been abused by the Chinese to train their own model.
R1 managed to replicate a model on the level of one one they had access to. But as far as I know they did not improve on its predictive performance? They did improve in inference time, but that is another thing. The ability to replicate a model is well demonstrated and quite common practice for some years already, see teacher-student distillation.
The progress has come from all kinds of things. Better distillation of huge models to small ones. Tool use. Synthetic data (which is not leading to model collapse like theorized). Reinforcement learning.
I don't know exactly where the progress over the next year will be coming from, but it seems hard to believe that we'll just suddenly hit a wall on all of these methods at the same time and discover no new techniques. If progress had slowed down over the last year the wall being near would be a reasonable hypothesis, but it hasn't.
The new "Full Self-Driving next year"?
FWIW, when you get this reductive with your criterion there were technically self-driving cars in 2008 too.
And both of these reduce traffic
We’re already at the point where these tools are removing repetitive/predictable tasks from researchers (and everyone else), so clearly they’re already accelerating research.
I’ll wait to see third party verification and/or use it myself before judging. There’s a lot of incentives right now to hype things up for OpenAI.
It is totally fair to discount OpenAI's statement until we have way more details about their setup, and maybe even until there is some level of public access to the model. But you're doing something very different: implying that their results are fraudulent and (incorrectly) using the Matharena results as your proof.
The previous time they had claims about solving all of the math right there and right then, they were caught owning the company that makes that independent test, and could neither admit nor deny training on closed test set.
- OpenAI doesn't own Epoch AI (though they did commission Epoch to make the eval)
- OpenAI denied training on the test set (and further denied training on FrontierMath-derived data, training on data targeting FrontierMath specifically, or using the eval to pick a model checkpoint; in fact, they only downloaded the FrontierMath data after their o3 training set was frozen and they didn't look at o3's FrontierMath results until after the final o3 model was already selected. primary source: https://x.com/__nmca__/status/1882563755806281986)
You can of course accuse OpenAI of lying or being fraudulent, and if that's how you feel there's probably not much I can say to change your mind. One piece of evidence against this is that the primary source linked above no longer works at OpenAI, and hasn't chosen to blow the whistle on the supposed fraud. I work at OpenAI myself, training reasoning models and running evals, and I can vouch that I have no knowledge or hints of any cheating; if I did, I'd probably quit on the spot and absolutely wouldn't be writing this comment.
Totally fine not to take every company's word at face value, but imo this would be a weird conspiracy for OpenAI, with very high costs on reputation and morale.
Everywhere I worked offered me a significant amount of money to sign a non-disparagement agreement after I left. I have never met someone who didn't willingly sign these agreements. The companies always make it clear if you refuse to sign they will give you a bad recommendation in the future.
you just brought several corp statements which are not grounded into any evidence, and could be not true, so you didn't say that much so far.
I don't know what exactly is at play here, and how exactly OpenAI's models can produce those "exceptionally good" results in benchmarks and at the same time be utterly unable to do even a quarter of that in private evaluation of pretty much everyone I knew. I'd expect them to use some kind of RAG techniques that make the question "what was in the training set at model checkpoint" irrelevant.
If you consider that several billion dollars of investment and national security are at stake, "weird conspiracy" becomes a regular Tuesday.
Unfortunately I can't see beyond the first message of that primary source.
What? This is a claim with all the trust-worthiness of OpenAI's claim. I mean I can claim anything I want at this point and it would still be just as trust-worthy as OpenAI's claim, with exactly zero details about anything else than "we did it, promise".
>GPT5 soon
>it will not be as good as this secret(?) model
There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.
More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.
If you looked at RLHF hiring over the last year, there was a huge hiring of IMO competitors to RLHF. This was a new, highly targeted, highly funded RLHF’ing.
https://benture.io/job/international-math-olympiad-participa...
https://job-boards.greenhouse.io/xai/jobs/4538773007
And Outlier/Scale, which was bought by Meta (via Scale), had many IMO-required Math AI trainer jobs on LinkedIn. I can't find those historical ones though.
I'm just one piece in the cog and this is an anecdote, but there was a huge upswing in IMO or similar RLHF job postings over the past 6mo-year.
I've been reading this website for probably 15 years, its never been this bad. many threads are completely unreadable, all the actual educated takes are on X, its almost like there was a talent drain
The Pro AI crowd, VC, tech CEOs etc have strong incentive to claim humans are obsolete. Many tech employees see threats to their jobs and want to poopoo any way AI could be useful or competitive.
The "It will just get better" is bubble baiting the investors. The tech companies learned from the past and they are riding and managing the bubble to extract maximum ROI before it pops.
The reality is a lot of work done by humans can be replaced by an LLM with lower quality and nuance. The loss in sales/satisfaction/ect is more than offset by the reduced cost.
The current model of LLMs are enshitification accelerators and that will have real effects.
I do not see that at all in this comment section.
There is a lot of denial and cynicism like the parent comment suggested. The comments trying to dismiss this as just “some high school math problem” are the funniest example.
I don’t think developers will be obsolete in five years. I don’t think AGI is around the corner. But I do think this is the biggest breakthrough in computer science history.
I worked on accelerating DNNs a little less than a decade ago and had you shown me what we’re seeing now with LLMs I’d say it was closer to 50 years out than 20 years out.
You mean the one that paves the way for ancient Egyptian slave worker economies?
Or totalitarian rule that 1984 couldn't imagine?
Or...... Worse?
The intermediate classes of society always relied on intelligence and competence to extract money from the powerful.
AI means those classes no longer have power.
I think what it feels like I see a lot, are people who - because of their fear of a future with super intelligent AI - try to like... Deny the significance of the event, if only because they don't _want_ to wrestle with the implications.
I think it's very important we don't do that. Let's take this future seriously, so we can align ourselves on a better path forward... I fear a future where we have years of bickering in the public forums on the veracity or significance of claims, if only because this subset of the public who are incapable of mentally wrestling with the wild fucking shit we are walking into.
If not this, what is your personal line in the sand? I'm not specifically talking to any person when I say this. I just can't help but to feel like I'm going crazy, seeing people deny what is right in front of their eyes.
You say, "explaining away the increasing performance" as though that was a good faith representation of arguments made against LLMs, or even this specific article. Questionong the self-congragulatory nature of these businesses is perfectly reasonable.
Something really funky is going on with newer AI models and benchmarks, versus how they perform subjectively when I use them for my use-cases. I say this across the board[1], not just regarding IpenAI. I don't know if frontier labs have run into Goodheart's law viz benchmarks, or if my use-cases that are atypical.
1. I first noticed this with Claud 3.5 vs Claud 3.7
AI is of course a direct attack on the average HNers identity. The response you see is like attacking a Christian on his religion.
The pattern of defense is typical. When someone’s identity gets attacked they need to defend their identity. But their defense also needs to seem rational to themselves. So they begin scaffolding a construct of arguments that in the end support their identity. They take the worst aspects of AI and form a thesis around it. And that becomes the basis of sort of building a moat around their old identity as an elite programmer genius.
Tell tale sign you or someone else is doing this is when you are talking about AI and someone just comments about how they aren’t afraid of AI taking over their own job when it wasn’t even directly the topic.
If you say like ai is going to lessen the demand for software engineering jobs the typical thing you here is “I’m not afraid of losing my job” and I’m like bro, I’m not talking about your job specifically, I’m not talking about you or your fear of losing a job I’m just talking about the economics of the job market. This is how you know it’s an identity thing more than a technical topic.
Also, please don't fulminate. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.
General attacks are fine, but we draw the line at personal.
So, as much I get the frustration comments like these don't really add much. Its complaining about others complaining. Instead this should be taken as a signal that maybe HN is not the right forum to read about these topics.
It's healthy to be skeptical, and it's even healthier to be skeptical of openai, but there are commenters who clearly have no idea of what IMO problems are saying that this means nothing somehow?
Almost every technical comment on HN is wrong (see for example essentially all the discussion of Rust async, in which people keep making up silly claims that Rust maintainers then attempt to patiently explain are wrong).
The idea that the "educated" takes are on X though... that's crazy talk.
There are a bunch of great accounts to follow that are only really posting content to x.
Karpathy, nearcyan, kalomaze, all of the OpenAI researchers including the link this discussion is on, many anthropic researchers. It's such a meme that you see people discuss reading Twitter thread + paper because the thread gives useful additional context.
Hn still has great comment sections on maker style posts, on network stuff, but I no longer enjoy the discussions wrt AI here. It's too hyperbolic.
Most of HN was very wrong about LLMs.
But in most other sites the statistic is 99%, so HN is still doing much better than average.
And like said, the researchers themselves are on X, even Gary Marcus is there. ;)
At this point, there are much better places to find technical discussion of AI, pros and cons. Even Reddit.
People here were pretty skeptical about AlexNet, when it won the ImageNet challenge 13 years ago.
What's the alternative here?
Some of us are implementing things in relation to AI so we know it's not about "increasing performance of models" but actual about the right solution for the right problem.
If you think Twitter has "educated takes" then maybe go there and stop being pretentious schmuck over here.
Talent drain, lol. I'd much rather have skeptics and good tips than usernames, follows and social media engagement.
X is higher signal, but very group thinky. It's great if you want to know the trends, but gotta be careful not to jump off the cliff with the lemmings.
Highest signal is obviously non digital. Going to meetups, coffee/beers with friends, working with your hands, etc.
As a partially separate issue, there are people trying to punish comments quoting AI by downvotes. You don't need to have a non-informative reply, just sourcing it to AI is enough. A random internet dude telling the same thing with less justification or detail is fine to them.
https://news.ycombinator.com/newsguidelines.html
https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...
Finding a critic perspective and try to understand why it can be wrong is more fun. You just say "I was wrong" when proved wrong.
How is it rational to 10x the budget over and over again when it mangles data every time?
The mind blowing thing is not being skeptical of that approach, it's defending it. It has become an article of faith.
It would be great to have AI chatbots. But chatbots that mangle data getting their budgets increased by orders of magnitude over and over again is just doubling down on the same mistake over and over again.
And then, it's likewise easy to be a reactionary to the extremes of the other side.
The middle is a harder, more interesting place to be, and people who end up there aren't usually chasing money or power, but some approximation of the truth.
As hackers we have more responsibility than the general public because we understand the tech and its side effects, we are the first line of defense so it is important to speak out not only to be on the right side of history but also to protect society.
Also, the performance of LLMs on imo 2025 was not even bronze [3].
Finally, this article shows that LLMs were just mostly bluffing [4] on usamo 2025.
[1] https://www.reddit.com/r/slatestarcodex/comments/1i53ih7/fro...
My skepticism stems from the past frontier math announcement which turned out to be a bluff.
The problem with the hype machine is that it provokes an opposite reaction and the noise from it buries any reasonable / technical discussion.
Usually my go-to example for LLMs doing more than mass memorization is Charton's and Lample's LLM trained on function expressions and their derivatives and which is able to go from the derivatives to the original functions and thus perform integration, but at the same time I know that LLMs are essentially completely crazy with no understanding of reality-- just ask them to write some fiction and you'll have the model outputting discussions where characters who have never met before are addressing each other by name, or getting other similarly basic things wrong, and when something genuinely is not in the model you will end up in hallucination land. So the people saying that the models are bad are not completely crazy.
With the wrong codebase I wouldn't be surprised if you need a finetune.
I think its opposite: general public blindly trusts all kind of hyped stuff, its a very few hyper-skeptical who are some fraction of percent of population.
Why waste time say lot word when few word do trick :)
Also worth pointing out that Alex Wei is himself a gold medalist at IOI.
Terence Tao also called it, that the top LLMs would get gold this year in a recent podcast.
These models are trained on all old problem and their various solutions.For LLM models, solving thses problems are as impressive as writing code.
There is no high generalization.
Nice result but it's just another game humans got beaten at. This time a game which isn't even taken very seriously (in comparison to ones that have professional scene).
I don't think it's very creative endeavor in comparison to chess/go. The searching required is less as well. There is a challenge processing natural language and producing solutions in it though.
Creativity required is not even a small fraction of what is required for scientific breakthroughs. After all no task that you can solve in 30 minutes or so can possibly require that much creativity - just knowledge and a fast mind - things computers are amazing at.
I am AI enthusiast. I just think a lot of things that were done so far are more impressive than being good at competitive math. It's a nice result blown out of proportion by OpenAI employees.
Edit: why was my comment moved from the one I was replying to? It makes no sense here on its own.
> I talked to IMO Secretary General Ria van Huffel at the IMO 2025 closing party about the OpenAI announcement. While I can't speak for the Board or the IMO (and didn't get a chance to talk about this with IMO President Gregor Dolinar, and I doubt the Board are readily in a position to meet for the next few days while traveling home), Ria was happy for me to say that it was the general sense of the Jury and Coordinators at IMO 2025 that it's rude and inappropriate for AI developers to make announcements about their IMO performances too close to the IMO (such as before the closing party, in this case; the general coordinator view is that such announcements should wait at least a week after the closing ceremony), when the focus should be on the achievements of the actual human IMO contestants and reports from AIs serve to distract from that.
> I don't think OpenAI was one of the AI companies that agreed to cooperate with the IMO on testing their models and don't think any of the 91 coordinators on the Sunshine Coast were involved in assessing their scripts.
it would raise more concerns that corps leaked questions/answers to training data and finetuned specialized models during this time.
https://x.com/natolambert/status/1946569475396120653
OAI announced early, probably we will hear announcement from Google soon.
The key difference is that they claim to have not used any verifiers.
Big if true. Setting up an RL loop for training on math problems seems significantly easier than many other reasoning domains. Much easier to verify correctness of a proof than to verify correctness (what would this even mean?) for a short story.
My proving skills are extremely rusty so I can’t look at these and validate them. They certainly are not traditional proofs though.
It reads like someone who found the correct answer but seemingly had no understanding of what they did and just handed in the draft paper.
Which seems odd, shouldn't an LLM be better at prose?
I don't know which one i would consider the most prestigious math competition but it wouldn't be The IMO. The Putnam ranks higher to me and I'm not even an American. But I've come to realise one thing and that is that high-school is very important to Americans...
My very rough napkin math suggests that against the US reference class, imo gold is literally a one in a million talent (very roughly 20 people who make camp could get gold out of very roughly twenty million relevant high schoolers).
I think that the number of students who are even aware of the competition is way lower than the total number of students.
I mean, I don’t think I’d have been a great competitor even if I tried. But I’m pretty sure there are a lot of students that could do well if given the opportunity.
If your school had a math team and you were on it, would be surprised if you didn't hear of it
You may not have heard of the IMO because no one in school district, possibly even state got in. It is extremely selective (like 20 students in the entire country)
GPT-5 finally on the horizon!
Although it's the benchmark that is publicly available. The model is not.
भारत दुर्दशा न देखी जाई...
The real reason might be that there's an enormous class of self-loathing elites in India who actively despise the possibility of any Indian language being represented in higher education. This obviously stunts the possibility of them being used in international competitions.
Discussions online have a tendency to go off into tangents like this. It's regrettable that this is such a contentious topic.
We detached this subthread from https://news.ycombinator.com/item?id=44615783.
Conclusion: It is overwhelmingly likely that this document was generated by a human.
----
Self-Correction/Refinement and Explicit Goals:
"Exactly forbidden directions. Good." - This self-affirmation is very human.
"Need contradiction for n>=4." - Clearly stating the goal of a sub-proof.
"So far." - A common human colloquialism in working through a problem.
"Exactly lemma. Good." - Another self-affirmation.
"So main task now: compute K_3. And also show 0,1,3 achievable all n. Then done." - This is a meta-level summary of the remaining work, typical of human problem-solving "
----
>Commenters on HN claim it must not be that hard, or OpenAI is lying, or cheated. Anything but admit that it is impressive
Every time on this site lol. A lot of people here have an emotional aversion to accepting AI progress. They’re deep in the bargaining/anger/denial phase.
> There is a related “Theorem” about progress in AI: once some mental function is programmed, people soon cease to consider it as an essential ingredient of “real thinking”. The ineluctable core of intelligence is always in that next thing which hasn’t yet been programmed. This “Theorem” was first proposed to me by Larry Tesler, so I call it Tesler’s Theorem: “AI is whatever hasn’t been done yet.”
Finance, chemistry, biology, medicine.
Given that AI companies are constantly trying to slurp up any and all data online, if the model was derived from existing work, it's maybe less impressive than at first glance. If present-day model does well at IMO 2026, that would be nice.
All my life I’ve taken for granted that your value is related to your positive impact, and that the unique value of humans is that we can create and express things like no other species we’ve encountered.
Now, we have created this thing that has ripped away many of my preconceptions.
If an AI can adequately do whatever a particular person does, then is there still a purpose for that person? What can they contribute? (No I am not proposing or even considering doing anything about it).
It just makes me sad, like something special is going away from the world.
It seems a common recent neurosis (albeit protective one) to proclaim a permanent human preeminence over the world of value, moral status and such for reasons extremely coupled with our intelligence, and then claim that certain kinds of intelligence have nothing to do with it when our primacy in those specific realms of intelligence is threatened. This will continue until there's nothing humans have left to bargain with.
The world isn't what we want it to be, the world is what it is. The closest thing we have to the world turning out the way we want it making it that way. Which is why I think many of those who hate AI would give their desires for how the world ought to be a better fighting chance by putting in the work to making it so, rather than sitting in denial at what is happening in the world of artificial intelligence.
I agree that denial is not an approach that’s likely to be productive.
Edit due to rate-limiting:
o3-pro returned an answer after 24 minutes: https://chatgpt.com/share/687bf8bf-c1b0-800b-b316-ca7dd9b009... Whether the CoT amounts to valid mathematical reasoning, I couldn't say, especially because OpenAI models tend to be very cagey with their CoT.
Gemini 2.5 Pro seems to have used more sophisticated reasoning ( https://g.co/gemini/share/c325915b5583 ) but it got a slightly different answer. Its chain of thought was unimpressive to say the least, so I'm not sure how it got its act together for the final explanation.
Claude Opus 4 appears to have missed the main solution the others found: https://claude.ai/share/3ba55811-8347-4637-a5f0-fd8790aa820b
Be interesting if someone could try Grok 4.
When (not if) AI does make a major scientific discovery, we'll hear "well it's not really thinking, it just processed all human knowledge and found patterns we missed - that's basically cheating!"
IE, they...
- Start with the context window of prior researchers.
- Set a goal or research direction.
- Engage in chain of thought with occasional reality-testing.
- Generate an output artifact, reviewable by those with appropriate expertise, to allow consensus reality to accept or reject their work.
https://x.com/polynoamial/status/1946478258968531288
"When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is."
and
"This was a small team effort led by @alexwei_ . He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community."
> It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.
> One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.
> The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an "honorable mention".
> But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways:
> * One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, consider a sci-fi scenario in the student is still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)
> * Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.
> * The team leader gives the students unlimited access to calculators, computer algebra packages, textbooks, or the ability to search the internet.
> * The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.
> * The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.
> * Each of the six students on the team submit solutions, but the team leader selects only the "best" solution to submit to the competition, discarding the rest.
> * If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.
> In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats indicated above.
> So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.
Source:
https://mathstodon.xyz/@tao/114881418225852441
tester756•10h ago
any details?
ktallett•10h ago
littlestymaar•9h ago
I mean it is quite impressive how language models are able to mobilize the knowledge they have been trained on, especially since they are able to retrieve information from sources that may be formatted very differently, with completely different problem statement sentences, different variable names and so on, and really operate at the conceptual level.
But we must wary of mixing up smart information retrieval with reasoning.
ktallett•9h ago
colinmorelli•4h ago
There are many industries for which the vast majority of work done is closer to what I think you mean by "smart retrieval" than what I think you mean by "reasoning." Adult primary care and pediatrics, finance, law, veterinary medicine, software engineering, etc. At least half, if not upwards of 80% of the work in each of these fields is effectively pattern matching to a known set of protocols. They absolutely deal in novel problems as well, but it's not the majority of their work.
Philosophically it might be interesting to ask what "reasoning" means, and how we can assess if the LLMs are doing it. But, practically, the impacts to society will be felt even if all they are doing is retrieval.
littlestymaar•2h ago
I wholeheartedly agree with that.
I'm in fact pretty bullish on LLMs, as tools with near infinite industrial use cases, but I really dislike the “AGI soon” narrative (which sets expectations way too high).
IMHO the biggest issue with LLMs isn't that they aren't good enough at solving math problem, but that there's no easy way to add information to a model after its training, which is a significant problem for a “smart information retrieval” system. RAG is used as a hack around this issue, but its performance can vary a ton with tasks. LORAs are another options, but they require significant work to make a dataset, and you can only cross your fingers the model keeps its abilities.
Jcampuzano2•7h ago