Gemini with Deep Think achieves gold-medal standard at the IMO

https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/

529•meetpateltech•6mo ago

Comments

lufenialif2•6mo ago

Still no information on the amount of compute needed; would be interested to see a breakdown from Google or OpenAI on what it took to achieve this feat.

Something that was hotly debated in the thread with OpenAI's results:

"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."

it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Doesn't diminish the result, but doesn't seem too different from classical ML techniques if quality of data in = quality of data out.

vonneumannstan•6mo ago

>it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Not sure thats exactly what that means. Its already likely the case that these models contained IMO problems and solutions from pretraining. It's possible this means they were present in the system prompt or something similar.

sottol•6mo ago

Or that they did significant retraining to boost IMO performance creating a more specialized model at the cost of general-purpose performance.

AlotOfReading•6mo ago

Does the IMO reuse problems? My understanding is that new problems are submitted each year and 6 are selected for each competition. The submitted problems are then published after the IMO has concluded. How would the training data contain unpublished, newly submitted problems?

Obviously the training data contained similar problems, because that's what every IMO participant already studies. It seems unlikely that they had access to the same problems though.

AlanYx•6mo ago

IMO doesn't reuse problems, but Terence Tao has a Mastodon post where he explains that the first five (of six) problems are generally ones where existing techniques can be leveraged to get to the answer. The sixth problem requires considerable originality. Notably, both Gemini and OpenAI's model didn't get the sixth problem. Still quite an achievement though.

Davidzheng•6mo ago

strange statement--it's not true in general for sure (3&6 typically hardest but they certainly aren't fundamentally of a different nature to other questions) this year P6 seemed to be by far the hardest though but this posthoc statement should be read cautiously

apayan•6mo ago

Do you have another source for that? I checked his Mastodon feed and don't see any mention about the source of the questions from the IMO.

https://mathstodon.xyz/@tao

vonneumannstan•6mo ago

>How would the training data contain unpublished, newly submitted problems?

I don't think I or op suggested it did.

dvh•6mo ago

Ok but when reported by mass media, which never used SI units and instead uses units like libraries of Congress, or elephants, what kind of unit should media use to compare computational energy of ai vs children?

dortlick•6mo ago

Kilocalories. A unit of energy that equals 4184 Joules.

lufenialif2•6mo ago

Convert libraries, elephants, etc into SI of course! Otherwise, they aren't really comparable...

thrance•6mo ago

If the models that got a gold medal are anything like those used on ARC-AGI, then you can bet they wrote an insane amount of text trying to reason their ways through these problems. Like, several bookshelves worth of writings.

So funnily enough, "the AI wrote x times the library of Congress to get there" is good enough of a comparison.

rfurmani•6mo ago

Dollars of compute at market rate is what I'd like to see, to check whether calling this tool would cost $100 or $100,000

gus_massa•6mo ago

4.5 hours × 2 "days", 100 Wats including support system.

I'm not sure how to implement the "no calculator" rule :) but for this kind of problems it's not critical.

Total = 900Wh = 3.24MJ

qnleigh•6mo ago

100 watts seems very low. A single Nvidia GeForce RTX 5090 is rated at ~600 watts. Probably they are using many GPUs/TPUs in parallel.

gus_massa•6mo ago

I forgot to explain in my comment, but my calculation is for humans.

If the computer uses ~600W, let's give it 45+45 minutes and we are even :) If they want to use many GPU ...

nicce•6mo ago

Some unofficial comparison with costs of public models (performing worse): https://matharena.ai/imo/

So the real cost is something much more.

pfortuny•6mo ago

They can train it n “Crux Mathematicorum” and similar journals, which are collections of “interesting” problems and their solutions.

https://cms.math.ca/publications/crux

gjm11•6mo ago

Human IMO contestants are also trained specifically on IMO problems.

jeffbee•6mo ago

Advanced Gemini, not Gemini Advanced. Thanks, Google. Maybe they should have named it MathBard.

sahila•6mo ago

Can't wait for Advanced Gemini Advance.

throwawayoldie•6mo ago

Advanced Gemini Advance Enterprise Advanced Edition (feat. Pitbull)

lufenialif2•6mo ago

Advanced Gemini Advance Enterprise Boy Advanced Edition 3 (feat. Pitbull) & Knuckles

rao-v•6mo ago

Don’t give them ideas

thewebguyd•6mo ago

Google and Microsoft continuing to prove that the hardest problem in programming is naming things.

nashashmi•6mo ago

There are only so many words in the modern English language to hint at "next upgrade": pro, plus, ultra, new, advanced, magna, X, Z, and Ultimate. Even fewer words to explain miniaturized: mini, lite, and zero. And marketers are trying to seesaw on known words without creating new ones to explain new tech. This is why we have Bard and Gemini and Chat and Copilot.

Taking a step back, it is overused exaggeration to the point where words run out quick and newer tech needs to fight with existing words for dominance. Copilot should be the name of an AI agent. Bard should have been just a text generator. Gemini is the name of a liar. Chat is probably the iphone of naming but the GPT suffix says Creativity had not come to work that day.

SirMaster•6mo ago

Good thing we have a system called numbers that can easily designate an infinite range of greater and greater things.

NitpickLawyer•6mo ago

Heh, you'd still get confusing stuff. Wait, is gemini 2.5.3.1 the math one or the erotica lit one?

antonvs•6mo ago

There are two different versions of that hard problem: the computer science version, and the marketing version. The marketing one has more nebulous acceptance criteria though.

blinding-streak•6mo ago

ChatGPT is also a terrible name for a product. OpenAI's branding for their models is awful.

seydor•6mo ago

They will have to rename gemini anyway, since it's doubtful they will ever be able to buy gemini.com. Gemini turboflex plus pro SE plaid interstellar 4.0

catigula•6mo ago

What makes you think google can't buy that domain?

esafak•6mo ago

... for a reasonable cost.

joaogui1•6mo ago

It's not a new product/model with that name, they're just saying it's an advanced version of Gemini that's not public atm

nomad_horse•6mo ago

Do I understand it correctly that OpenAI self-proclaimed that they got their gold, without the official IMO judges grading their solutions?

ipsum2•6mo ago

Yes

jacquesm•6mo ago

Why am I not surprised?

danjl•6mo ago

Childish. And, of course they must have known there was an official LLM cohort taking the real test, and they probably even knew that Gemini got a gold medal, and may have even known that Google planned a press release for today.

cma•6mo ago

I think maybe all Altman companies have used tactics like this.

> We were trying to get a big client for weeks, and they said no and went with a competitor. The competitor already had a terms sheet from the company were we trying to sign up. It was real serious.

> We were devastated, but we decided to fly down and sit in their lobby until they would meet with us. So they finally let us talk to them after most of the day.

> We then had a few more meetings, and the company wanted to come visit our offices so they could make sure we were a 'real' company. At that time, we were only 5 guys. So we hired a bunch of our college friends to 'work' for us for the day so we could look larger than we actually were. It worked, and we got the contract.

> I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.

https://news.ycombinator.com/item?id=3048944

pishpash•6mo ago

So, a more charismatic version of Zuck is Zucking, what a surprise. Company culture starts at its origin. Despite Google's corruption, its origin is in academia and it shows even now.

gowld•6mo ago

For a long time, the YC application asked founders for an example of how they "hacked" (cheated) a system.

tough•6mo ago

I think pg was into it way before sama was a baby lol https://www.paulgraham.com/gh.html

tough•6mo ago

>> > I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.

does he? wasn't sama ousted of YC in some muddy ways after he tried to co-opt in into an OpenAI investment arm, was funny to find the YC Open Research project landing page on yc's website now defunct and pointing how he misrepresented it as a YC project when it was his own

maybe he fears him, but I doubt pg respects him, unless he respects evil, lol

cma•6mo ago

The post was from 14 years ago, before that.

tough•6mo ago

oh, gotcha

gametorch•6mo ago

(deleted because I was mistaken)

nomad_horse•6mo ago

isn't he an IOI medalist? and even if he was an IMO medalist, isn't there a bit of a conflict of interests?

cbsmith•6mo ago

Just a wee bit.

cbsmith•6mo ago

I'm pretty sure when they got the gold medal they weren't allowed to judge themselves.

NitpickLawyer•6mo ago

Well, I don't doubt that they did get those results, but it is clear now that it was not an official collaboration. It was heavily implied in a statement by IMO's president a few days ago (the one where they said they'd prefer AI companies wait a week before announcing, so that the focus is first on the human competitors).

Goog had an official colab with IMO, and we can be sure they got those results under the imposed constraints (last year they allocated ~48h for silver IIRC) and an official grading by the IMO graders.

Tadpole9181•6mo ago

So from 48 hours for silver down to 4.5 hours for gold in one year? And all reasoning generated is clear and easy to follow? That's one hell of an improvement.

tshadley•6mo ago

Yes, OpenAI:

https://x.com/alexwei_/status/1946477754372985146

> 6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!

That means Google Deepmind is the first OFFICIAL IMO Gold.

https://x.com/demishassabis/status/1947337620226240803

> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!

nomad_horse•6mo ago

Do you know if OpenAI used the same grading criteria as official judges?

tshadley•6mo ago

As IMO medalists they would be expected to I'm sure.

But this can be verified because the results are public:

https://github.com/aw31/openai-imo-2025-proofs/

lvl155•6mo ago

Seems OpenAI knew this is forthcoming so they front ran the news? I was really high on Gemini 2.5 Pro after release but I kept going back to o3 for anything I cared about.

sigmoid10•6mo ago

>I was really high on Gemini 2.5 Pro after release but I kept going back to o3 for anything I cared about

Same here. I was impressed by their benchmarks and topping most leaderboards, but in day to day use they still feel so far behind.

danjl•6mo ago

I think that's most likely just your view, and not really based on evidence.

aerhardt•6mo ago

I use o3, openAI API and Claude Code. Genuinely curious what about Gemini 2.5 is so far behind?

nicce•6mo ago

I don’t think it is behind in anything. It is just harder to make obey and redefine the default system command. It is very verbose model by default.

sigmoid10•6mo ago

It's not just much more verbose in general, it easily gets lost in verbosity. As in often trying to solve issues that aren't there. And when you try to make it focus, it doesn't get the issue at all.

erichocean•6mo ago

I regularly have the opposite experience: o3 is almost unusable, and Gemini 2.5 Pro is reliably great. Claude Opus 4 is a close second.

o3 is so bad it makes me wonder if I'm being served a different model? My o3 responses are so truncated and simplified as to be useless. Maybe my problems aren't a good fit, but whatever it is: o3 output isn't useful.

Davidzheng•6mo ago

I have this distinctive feeling that o3 tries to trick me intentionally when it can't solve a problem by cleverly hiding its mistakes. But I could be imagining it

int_19h•6mo ago

It's certainly the "laziest" model, in the sense that it seems to be the likeliest to avoid doing the actual work and generate "TBD" stubs instead.

helloplanets•6mo ago

Are you using a tool other than ChatGPT? If so, check the full prompt that's being sent. It can sometimes kneecap the model.

Tools having slightly unsuitable built in prompts/context sometimes lead to the models saying weird stuff out of the blue, instead of it actually being a 'baked in' behavior of the model itself. Seen this happen for both Gemini 2.5 Pro and o3.

square_usual•6mo ago

Are you using o3 on the official ChatGPT app or via API? I use it on the app and it performs very well, it's my go-to model for general purpose LLM use.

erichocean•6mo ago

official ChatGPT app

deanCommie•6mo ago

From Terence Tao, via mastodon [0]:

> It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.

> One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.

> The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an "honorable mention".

> But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways:

* One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, consider a sci-fi scenario in the student is still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)

* Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.

* The team leader gives the students unlimited access to calculators, computer algebra packages, formal proof assistants, textbooks, or the ability to search the internet.

* The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.

* The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.

* Each of the six students on the team submit solutions, but the team leader selects only the "best" solution to submit to the competition, discarding the rest.

* If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.

> In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats indicated above.

> So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.

> Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition. EDIT: In particular, the above comments are not specific to any single result of this nature.

[0] https://mathstodon.xyz/@tao/114881418225852441

Mond_•6mo ago

Unlike OpenAI, Deepmind at least signed up for the competition ahead of time.

Agree with Tao though, I am skeptical of any result of this type unless there's a lot of transparency, ideally ahead of time. If not ahead of time, then at least the entire prompt and fine-tune data that was used.

swyx•6mo ago

OAI's story here: https://x.com/polynoamial/status/1947398532899738064

apparently IMO emailed them. but then they completed the IMO eval independently.

ferguess_k•6mo ago

This is a fair reply, but TBH I don't think it's going to change much. The upper echelon of the human society has decided to move AI forward rapidly regardless of any consequences. The rest of us can only hold and pray.

justanotherjoe•6mo ago

You are watching american money hard at work, my friend. It's either glorious or reckless, hard to tell for now.

ferguess_k•6mo ago

Could be both, but one for different group of people.

scotty79•6mo ago

Some of the critique is valid but some of it sounds like, "but the rules of the contest are that participants must use less than x joules of energy obtained from cellular respiration and have a singular consciousness"

I don't think anybody thinks AI was competing fair and within the rules that apply to humans. But if the humans were competing on the terms that AI solved those problems on, near-unlimited access to energy, raw compute and data, still very few humans could solve those problems within a reasonable timeframe. It would take me probably months or years to educate myself sufficiently to even have a chance.

earthicus•6mo ago

I don't think that characterization is fair at all. It's certainly true that you, me, and most humans can't solve these problems with any amount of time or energy. But the problems are specifically written to be at the limit of what the actual high school students who participate can solve in four hours. Letting the actual students taking the test have four days instead of four hours would make a massive difference in their ability to solve them.

Said differently, the students, difficulty of the problems, and time limit are specifically coordinated together, so the amount of joules of energy used to produce a solution is not arbitrary. In the grand scheme of how the tech will improve over time, it seems likely that doesn't matter and the computers will win by any metric soon enough, but Tao is completely correct to point out that you haven't accurately told us what the machines can do today, in July 2025, without telling us ahead of time exactly what rules you are modifying.

scotty79•6mo ago

> you haven't accurately told us what the machines can do today

It told us that using probably unreasonable compute they can solve few math problems that are very hard to solve for high-schools students in 4 hours. That's all the rules. No need to state them ahead of time or at all because they are obvious from the context.

It was just a fun thing to check and good meme for the media. Obviously the machines can do a lot more in some aspects and a lot less in others.

If we at least maintain the current pace in few years instead of machines solving 5 problems almost all high-schools students are incapable of solving in 4 hours they are gonna solve a problem Terence Tao is not capable of solving in one lifetime. Hopefully.

dang•6mo ago

Discussed here:

A human metaphor for evaluating AI capability - https://news.ycombinator.com/item?id=44622973 - July 2025 (30 comments)

be7a•6mo ago

Super interesting that they moved away from their specialized, Lean-based system from last year to a more general-purpose LLM + RL approach. I would suspect this likely leads to improved performance even outside of math competitions. It’ll be fascinating to see how much further this frontier can go.

The article also suggests that the system used isn’t too far ahead of their upcoming general "DeepThink" model / feature, which is they announced for this summer.

gundmc•6mo ago

I assume that an "advanced version" of Gemini Deepthink means it was a different model or had significantly more test time compute than the upcoming Deepthink in the Gemini Ultra subscription. Still cool to see OpenAI and Google neck and neck.

sxp•6mo ago

Related news:

- OpenAI claims gold-medal performance at IMO 2025 https://news.ycombinator.com/item?id=44613840

- "According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.

According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinators is that it was rude and inappropriate" for OpenAI to do this.

OpenAI wasn't one of the AI companies that cooperated with the IMO on testing their models, so unlike the likely upcoming Google DeepMind results, we can't even be sure OpenAI's "gold medal" is legit. Still, the IMO organizers directly asked OpenAI not to announce their results immediately after the olympiad.

Sadly, OpenAI desires hype and clout a lot more than it cares about letting these incredibly smart kids celebrate their achievement, and so they announced the results yesterday." https://x.com/mihonarium/status/1946880931723194389

KaoruAoiShiho•6mo ago

OpenAI announced their results after the closing ceremony as was requested. https://x.com/polynoamial/status/1947024171860476264?s=46

nicce•6mo ago

> as was requested.

They requested week after?

modeless•6mo ago

He claims nobody made that request to OpenAI. It was a request made to Google and others who were being judged by the actual contest judges, which OpenAI was not.

gowld•6mo ago

IMO agreed to cancel the embargo after OpenAI announced.

mike_d•6mo ago

Proposed question for next IMO: "Show a proof that 'after the closing ceremony' and 'one week later' are not the same unit of time"

ml-anon•6mo ago

This is weasily bullshit from Brown.

CamperBob2•6mo ago

OT, but please consider posting links like this using xcancel, so that non-X users can read them.

https://xcancel.com/polynoamial/status/1947024171860476264?s...

sherburt3•6mo ago

What a great metaphor for AI. Taking an event that is a celebration of high school kids' knowledge and abilities and turning it into a marketing stunt for their frankenstein monster that they are building to make all the kids' hard work worth nothing.

echelon•6mo ago

Google did the correct and respectful thing.

It appears that OpenAI didn't officially enter (whereas Google did), that they knew Google was going to gold medal, and that they released their news ahead of time (disrespecting the kids and organizers) so they could scoop Google.

Really scummy on OpenAI's part.

The IMO closing ceremony was July 19. OpenAI announced on the same day.

IMO requested the tech companies to wait until the following week.

ml-anon•6mo ago

Not only, by not officially entering they had no obligation to announce their result so if they didn't achieve a gold medal score they presumably wouldn't have made any announcement and no-one would have been the wiser.

This cowardly bullshit followed by the grandstanding on Twitter is high-school bully behaviour.

gowld•6mo ago

If they failed and remained quiet, then everyone would know that the other companies performed well and they didn't even qualify.

tough•6mo ago

If they failed while not participating officially, they would have never competed at the eyes public if they didnt disclose it (doubtful, given prior decisions to prioritise hype vs transparency)

tough•6mo ago

also did they self-rate themselves?

weatherlite•6mo ago

Well they keep doing it with programming, art and video, and soon probably most white collar jobs , terrifying and hurting hundreds of millions of workers in the process. I think we are way passed the point of tender sensibilities.

ekojs•6mo ago

> Btw as an aside, we didn’t announce on Friday because we respected the IMO Board's original request that all AI labs share their results only after the official results had been verified by independent experts & the students had rightly received the acclamation they deserved

From https://x.com/demishassabis/status/1947337620226240803

Was OpenAI simply not coordinating with the IMO Board then?

dmitrygr•6mo ago

> Was OpenAI simply not coordinating with the IMO Board then?

You are still surprised by sama@'s asinineness? You must be new here.

antonvs•6mo ago

When your goal is to control as much of the world's money as possible, preferably all of it, then everyone is your enemy, including high school students.

dmitrygr•6mo ago

How dare those high school students use their brains to compete with ChatGPT and deny the shareholders their value?

MisterPea•6mo ago

I am still surprised many people trust him. The board's (justified) decision to fire him was so awfully executed that it lead to him having even more slack

thegeomaster•6mo ago

Yes, there have been multiple (very big) hints dropped by various people that they had no official cooperation.

pphysch•6mo ago

This reminds me of when OpenAI made a splash (ages ago now) by beating the world's best Dota 2 teams using a RL model.

...Except they had to substantially bend the rules of the game (limiting the hero pool, completely changing/omitting certain mechanics) to pull this off. So they ended up beating some human Dota pros at a psuedo-Dota custom game, which was still impressive, but a very much watered-down result beneath the marketing hype.

It does seem like Money+Attention outweigh Science+Transparency at OpenAI, and this has always been the case.

NitpickLawyer•6mo ago

Limiting the hero pool was fair I'd say. If you can prove RL works on one hero, it's fairly certain it would work on other heroes. All of them at once? Maybe run into problems. But anyway you'd need orders of magnitude more compute so I'd say that was fair game.

npinsker•6mo ago

It's not even close to the same game as Dota. Limiting the hero (and item) pool so drastically locks off many strategies and counters. It's a bit hard to explain if you haven't played, but full Dota has many more tools and much more creativity than the reduced version on display. The behavior does not evidently "scale up", in the same way that the current SotA of AI art and writing won't evidently replace top-level humans.

I'd never say it's impossible, but the job wasn't finished yet.

pphysch•6mo ago

That's akin to saying it's okay to remove Knights, or castling, or en passant from chess because they have a complicated movement mechanic that the AI can't handle as well.

Hero drafting and strategy is a major aspect of competitive Dota 2.

osti•6mo ago

I think this is them not being confident enough before the event, so they don't wanna be shown a worse result than competitors. By being private they can obviously not publish anything if it didn't work out.

ml-anon•6mo ago

As not-so-subtly hinted at by Terry Tao.

Its a great way to do PR but its a garbage way to to science.

osti•6mo ago

True, but openai definitely isn't trying to do public research on science, they are all about money now.

ml-anon•6mo ago

Thats not a contentious statement. Its still a pathetic way to behave at a kids competition no less.

esafak•6mo ago

They shot themselves in the foot by not showing the confidence that Google did.

tornikeo•6mo ago

I think we are having a Deep Blue vs. Kasparov moment in Competitive Math right now. This is a large progress from just a few years ago and yet I think we still are really far away from even a semi-respectable AI mathematician. What an exciting time to be alive!

NitpickLawyer•6mo ago

Terrence Tao, in a recent podcast, said that he's very interested in "working along side these tools". He sees the best use in the near future as "explorers of human set vision" in a way. (i.e. set some ideas/parameters and let the LLMs explore and do parallel search / proof / etc)

Your comparison with chess engines is pretty spot-on, that's how the best of the best chess players do prep nowadays. Gone are the multi person expert teams that analysed positions and offered advice. They now have analysts that use supercomputers to search through bajillions of positions and extract the best ideas, and distill them to their players.

tough•6mo ago

> They now have analysts that use supercomputers to search through bajillions of positions and extract the best ideas, and distill

I was recently researching AI's for this, seems it would be a huge unlock for some parts of science where this is the case too like chess

7373737373•6mo ago

He created a Youtube channel showcasing working alongside these tools: https://youtube.com/@TerenceTao27

j2kun•6mo ago

He also had some interesting things to say about these IMO results: https://mathstodon.xyz/@tao/114881418225852441

ars•6mo ago

The Wikipedia doesn't have much info on the results, but from other reading I got the impression that the combination produced results stronger than any individual human or computer player.

gjm11•6mo ago

My understanding is that they did, but don't any more; it's no longer true that humans understand enough things about chess better than computers for the human/computer collaboration to contribute anything over just using the computer.

I don't think the interval between "computers are almost as strong as humans" and "computers are so much stronger than humans that there's no way for even the strongest humans to contribute anything that improves the computer's play" was very long. We'll see whether mathematics is any different...

KK7NIL•6mo ago

> My understanding is that they did, but don't any more; it's no longer true that humans understand enough things about chess better than computers for the human/computer collaboration to contribute anything over just using the computer.

This is not true, at least not in very long time formats like correspondence chess: https://en.chessbase.com/post/correspondence-chess-and-corre...

There's also many well known cases where even very strong engines miscalculate and can be beaten (especially in fast time controls or closed positions): https://www.chess.com/blog/SamCopeland/hikaru-nakamura-crush...

The horizon effect is still very real in engines, although it's getting harder and harder to exploit.

gjm11•6mo ago

Your Nakamura example is from 2008. That's 17 years ago. The machines have improved a lot since then, hardware and software both. I've seen Nakamura beat strong-but-still-limited bots playing "anti-computer chess" but I am fairly sure he would be eaten alive if he tried it against present-day Stockfish or Leela on good hardware.

Maybe you're right about correspondence chess. That interview is from 2018 and the machines have got distinctly stronger in that time, but 7 years isn't so long and it could be that human input still has some value for CC.

KK7NIL•6mo ago

I'm aware the Nakamura example is old, but the core issue (the horizon effect) is still there in any alpha/beta pruning engine, including the newest SF. But I will certainly grant you that it has become much harder to execute since then. MCTS engines (like Lc0) are far more immune to the horizon effect, but can instead suffer from missing very shallow tactics, especially in very fast time controls, as Andrew Tang showed in his match against an early version of Lc0 7 years ago: https://lichess.org/@/lichess/blog/gm-andrew-tang-defends-hu...

> Maybe you're right about correspondence chess. That interview is from 2018 and the machines have got distinctly stronger in that time, but 7 years isn't so long and it could be that human input still has some value for CC.

I've played on ICCF and I can say that the draw situation is much worse now, exactly because engines are so much stronger now, so it's much harder to find good opening novelties against a competent opponent. Engines were infamous for misjudging many closed openings like the KID, but even in the last few years that has really tightened up.

Still, the human element is useful in trying to steer clear of drawish openings and middlegame lines.

NitpickLawyer•6mo ago

There's a beautiful game between SF and Lc0 (a few months ago) where Stockfish thinks it's winning, while Lc0 has a lock on a draw. Lc0 then proceeds to sack 4 pieces and draw, but SF (with NNUE) only "sees" the draw 2 moves into the position.

KK7NIL•6mo ago

I think I found it (thanks reddit): https://www.chess.com/computer-chess-championship#event=ccc2...

Indeed a good showcase of the horizon effect between an a/b and a MCTS engine, thanks for sharing!

llelouch•6mo ago

The key distinction is that chess players compete against each other, whereas mathematicians engage in a dialogue with mathematics itself.

drexlspivey•6mo ago

More like Deep Blue vs Child prodigy. In the IMO the contestants are high school students not the greatest mathematicians in the world.

bbconn•6mo ago

Of course contest math is not research math but the active IMO kids are pretty much the best in the world at math contests.

aldousd666•6mo ago

The difference here is that no amount of brute force can produce the winning score in the timeframe. This is more of a legitimate technical milestone and less of a moral 'in principle' one that we saw with deep blue

modeless•6mo ago

> AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions

So, the problem wasn't translated to Lean first. But did the model use Lean, or internet search, or a calculator or Python or any other tool during its internal thinking process? OpenAI said theirs didn't, and I'm not sure if this is exactly the same claim. More clarity on this point would be nice.

I would also love to know the rough order of magnitude of the amount of computation used by both systems, measured in dollars. Being able to do it at all is of course impressive, but not useful yet if the price is outrageous. In the absence of disclosure I'm going to assume the price is, in fact, outrageous.

Edit: "No tool use, no internet access" confirmed: https://x.com/FredZhang0/status/1947364744412758305

rfurmani•6mo ago

Sounds like it did not:

> This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit

modeless•6mo ago

Yes, that quote is contained in my comment. But I don't think it unambiguously excludes tool use in the internal chain of thought.

I don't think tool use would detract from the achievement, necessarily. I'm just interested to know.

KoolKat23•6mo ago

End to end in natural language would imply no tool use, I'd imagine. Unless it called another tool which converted it but that would be a real stretch (smoke and mirrors).

simonw•6mo ago

I interpreted that bit as meaning they did not manually alter the problem statement before feeding it to the model - they gave it the exact problem text issued by IMO.

It is not clear to me from that paragraph if the model was allowed to call tools on its own or not.

jonahx•6mo ago

As a side question, do you think using tools like Lean will become a staple of these "deep reasoning" LLM flavors?

It seems that LLMs excel (relative to other paradigms) in the kind of "loose" creative thinking humans do, but are also prone to the same kinds of mistakes humans make (hallucinations, etc). Just as Lean and other formal systems can help humans find subtle errors in their own thinking, they could do the same for LLMs.

simonw•6mo ago

I was surprised to see them not using tools for it, that feels like a more reliable way to get useful results for this kind of thing.

I get the impression not using tools is as part of the point though - to help demonstrate how much mathematical "reasoning" you can get out of just a model on its own.

edanm•6mo ago

Yes, I'm similarly surprised. Intuitively I'd think that it's much better to train on using Lean, since it's much easier to do RL on it (Lean gives you an objective metric on whether you achieved your objective). It also seems more useful in some ways.

But all the model providers are putting emphasis on the "this is only using natural language" angle, which I think is interesting both from a "this is easier for humans to actually use" perspective, but also comes from a place of "look how general the model is".

getnormality•6mo ago

We're told that formal verification tools like Lean are not used to solve the actual IMO problems, but are they used in training the model to solve the problems?

We know from Google's 2024 IMO work that they have a way to translate natural language proofs to formally verifiable ones. It seems like a natural next step would be to leverage this for RLVR in training/fine-tuning. During training, any piece of reasoning generated by the math LLM could be translated, verified, and assigned an appropriate reward, making the reward signal much denser.

Reward for a fully correct proof of a given IMO problem would still be hard to come by, but you could at least discourage the model from doing wrong or indecipherable things. That plus tons of compute might be enough to solve IMO problems.

In fact it probably would be, right? We already know from AlphaProof that by translating LLM output back and forth between formal Lean proofs, you can search the space of reasoning moves efficiently enough to solve IMO-class problems. Maybe you can cut out the middleman by teaching the LLM via RLVR to mimic formal reasoning, and that gets you roughly the same efficiency and ability to solve hard problems.

modeless•6mo ago

It seems very likely from the description in the link that formal verification tools for mathematical proofs were used in part of the RL training for this model. On the other hand, OpenAI claims "We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling." Which might suggest that they don't specifically use e.g. Lean in their training process. But it's not explicitly stated. All we can really do is speculate unless they publish more detail.

getnormality•6mo ago

The OpenAI proofs are so brutally, inhumanly spartan that I can't imagine how the AI came up with them, except by RLVR against some crudely translated formal language.

daxfohl•6mo ago

I'd also be curious as to why not use Lean. Is it that Lean use at this point makes the problems too easy to brute force? Or is it that Lean at this point just gets in the way of things?

wbhart•6mo ago

Lean is an interactive prover, not an automated prover. Last year a lot of human effort was required to formalise the problems in Lean before the machines could get to work. This year you get natural language input and output, and much faster.

The advantage of Lean is that the system checks the solutions, so hallucination is impossible. Of course, one still relies on the problems and solutions being translated to natural language correctly.

Some people prefer difficult to read formally checked solutions over informal but readable solutions. The two approaches are just solving different problems.

But there is another important reason to want to do this reliably in natural language: you can't use Lean for other domains (with a few limited exceptions). They want to train their RL pipelines for general intelligence and make them reliable for long horizon problems. If a tool is needed as a crutch, then it more or less demonstrates that LLMs will not be enough in any domain, and we'll have to wait for traditional AI to catch up for every domain.

daxfohl•6mo ago

Oh, I didn't realize that last year the problem formalization was a human effort; I assumed the provers themselves took the problem and created the formalization. Is this step actually harder to automate than solving the problem once it's formalized?

Anyway mainly I was curious whether using an interactive prover like Lean would have provided any advantage, or whether that is no longer really the case. My initial take would be that, yes, it should provide a huge advantage. Like in chess and go, it'd allow it to look algorithmically through a huge search space and check which approaches get it closer to resolving, where the AI is "only" responsible for determining what approaches to try.

OTOH, maybe not. Maybe the search space is so big that trying to go through it linearly is a waste of CPU. In which case, plausibly the translation to Lean offers no benefit. And now that I think about it, I could imagine that. When doing problems like these, you kind of have to figure out the overall approach end to end first, fill in any gaps in your logic, and the formalization/writing step is kind of the last thing you do. So I could see where starting on formalization from the start could end up being the wrong approach for IMO-level problems. It'd just be nice to have that confirmed.

The cool thing is that if true, it implies this is something completely different from the chess/go engines that rely on sheer computational power. Not so much of a "deep blue" moment, but more of an existential one.

wbhart•6mo ago

I have not been working on formalization but theorem proving, so I can't confidently answer some of those questions.

However, I recognise that there is not so much training data for LLMs wanting to use the Lean language. Moreover, you are really teaching it how to apply "Lean tactics", which may or may not be related to what mathematicians do in standard texts on which LLMs have trained. Finally, the foundations are different: dependent type theory, instead of the set theory that mathematicians use.

My own personal perspective is that esoteric formal languages serve a purpose, but not this one. Most mathematicians have not been hot on the idea (with a handful of famous exceptions). But the idea seems to have gained a lot of traction anyway.

I'd personally like to see more money put into informal symbolic theorem proving tools which can very rapidly find a solution as close to natural language and the usual foundations as possible. But funding seems to be a huge issue. Academia has been bled dry, and no one has an appetite for huge multi-year projects of that kind.

machiaweliczny•6mo ago

I think because you want to input mathematical proof intuition (heuristic) into models so they can grasp our reality better than just use tools without much clue.

jay_kyburz•6mo ago

I wonder if "not tool use, no internet access" means it can run without google inf, and offline. Meaning it could be deployed locally for people that need that.

brokensegue•6mo ago

based on the score looks like they also couldn't answer question 6? has that been confirmed?

liuliu•6mo ago

https://storage.googleapis.com/deepmind-media/gemini/IMO_202...

brokensegue•6mo ago

i saw that but it doesn't answer my question since it doesn't have associated marks? i'm not about to check their answer to a question i can't answer

simonw•6mo ago

That PDF lists solutions for problems 1 through 5 but does not mention problem 6 at all.

brokensegue•6mo ago

Ah thanks that does answer it. You should start a blog

tough•6mo ago

Simon's a prolific writer on AI

http://simonw.substack.com https://simonwillison.net

brokensegue•6mo ago

cool! glad he took my advice

tough•6mo ago

oracle!!

osti•6mo ago

The proofs also seem more human reable than openai's?

b8•6mo ago

Surprising since Reid Barton was working on a lean system.

dist-epoch•6mo ago

The bitter lesson.

tough•6mo ago

well lean systems might be still useful for other stuff than max benching

my point being transformers and llms have all the tailwind of all the infra+lateral discoveries/improvements being put into them.

does that mean they're the one tool to unlock machine intelligence? I dunno

qbit42•6mo ago

I think formal systems like Lean are still extremely interesting. And I would imagine that, if these machines get good at standard "informal" proofs, that the overhead of formalization would be a lot less painful for them than humans (it is pretty tedious these days but even for humans the friction is decreasing).

Verified proofs allow you to collaborate with much less trust (for humans or machines), at least in the phase where you are trying to figure out what is true. They don't guarantee that a proof is insightful for "why" a result is true, but that is much easier to put together once you have a valid proof.

shiandow•6mo ago

Comparing the answers between Openai and Gemini the writing style of Gemini is a lot clearer. It could be presented a bit better but it's easy enough to follow the proof. This also makes it a lot shorter than the answer given by OpenAI and it uses proper prose.

cubefox•6mo ago

I found the proofs you were referring to:

Google https://storage.googleapis.com/deepmind-media/gemini/IMO_202...

OpenAI https://github.com/aw31/openai-imo-2025-proofs/

sweezyjeezy•6mo ago

Gemini is clearer but MY GOD is it verbose. e.g. look at problem 1, section 2. Analysis of the Core Problem - there's nothing at all deep here, but it seems the model wants to spell out every single tiny logical step. I wonder if this is a stylistic choice or something that actually helps the model get to the end.

shiandow•6mo ago

Section 2 is a case by case analysis. Those are never pretty but perfectly normal given the problem.

With OpenAI that part takes up about 2/3 if the proof even with its fragmented prose. I don't think it does much better.

sweezyjeezy•6mo ago

It's not it being case by case that's my issue. I used do olympiads and e.g. for the k>=3 case I wouldn't write much more than:

"Since there are 3k - 3 points on the perimeter of the triangle to be covered, and any sunny line can pass through at most two of them, it follows that 3k − 3 ≤ 2k, i.e. k ≤ 3."

Gemini writes:

Let Tk be the convex hull of Pk. Tk is the triangle with vertices V1 = (1, 1), V2 = (1, k), V3 = (k, 1). The edges of Tk lie on the lines x = 1 (V), y = 1 (H), and x + y = k + 1 (D). These lines are shady.

Let Bk be the set of points in Pk lying on the boundary of Tk. Each edge contains k points. Since the vertices are distinct (as k ≥ 2), the total number of points on the boundary is |Bk| = 3k − 3.

Suppose Pk is covered by k sunny lines Lk. These lines must cover Bk. Let L ∈ Lk. Since L is sunny, it does not coincide with the lines containing the edges of Tk. A line that does not contain an edge of a convex polygon intersects the boundary of the polygon at most at two points. Thus, |L ∩ Bk| ≤ 2. The total coverage of Bk by Lk is at most 2k. We must have |Bk| ≤ 2k. 3k − 3 ≤ 2k, which implies k ≤ 3.

shiandow•6mo ago

I'll admit I didn't look to deeply if it could be done simpler, but surely that is still miles better than what OpenAI did? At least Gemini's can be simplified. OpenAI labels all points and then enumerates all the lines that go through them.

vessenes•6mo ago

They actually do help - in that they give the model more computation time and also allow realtime management of the input context by the model. You can see this same behavior in the excessive comment writing some coding models engage in; Anthropic interviews said these do actually help the model.

johnfn•6mo ago

Gemini did not one-shot these answers; it did its thinking elsewhere (probably not released by Google) and then it consolidated it down into what you see in the PDF. From the article:

I don't see any parallel thinking, e.g., so that was probably elided in the final results.

noahgav•6mo ago

Yes, because these are the answers it gave, not the thinking.

CamperBob2•6mo ago

Kind of disappointing that neither provider shows the unsuccessful attack on problem 6.

cubefox•6mo ago

They don't show any reasoning traces at all, just the final proofs. We must assume the traces are pretty huge, since at least Google makes it clear that they are heavily relying on inference compute:

> We achieved this year’s result using an advanced version of Gemini Deep Think – an enhanced reasoning mode for complex problems that incorporates some of our latest research techniques, including parallel thinking. This setup enables the model to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought. [...] We will be making a version of this Deep Think model available to a set of trusted testers, including mathematicians, before rolling it out to Google AI Ultra subscribers.

aabhay•6mo ago

Based on the Google presented answers, its possible that the report is generated post-hoc as a summarization of the prior thoughts. One could also presume that this summarization step is part of the mechanism for running the Tree of Thoughts too, so that this wasn’t some manual “now give the final answer” command.

guluarte•6mo ago

Looks like models can solve slightly modified problems and extrapolate from their training data, amazing! /s

I will be surprised when a model with only the knowledge of a college student can solve these problems.

akomtu•6mo ago

Those who keep their identity in their intelligence are heading into the rough seas once the proto-AI becomes real-AI in the coming years. What's the value of your smart thoughts if AI on your smartwatch can do it better, faster and cheaper?

Also, how is AI going to change a society ruled by competitiveness, where the winner takes all? You may not want to replace your thinking with AI, but your colleagues will. Their smartwatches or smartglasses will outcompete you with ease and your boss will tell you one day that the company doesn't need you anymore.

Think of it again. Today advertisers fight each other with their ad budgets: those who spend more, get more attention and win. Tomorrow everyone will need a monthly subscription to AI for it will be the price of staying competitive, relevant and employed.

kevinventullo•6mo ago

This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions

I think I have a minority opinion here, but I’m a bit disappointed they seem to be moving away from formal techniques. I think if you ever want to truly “automate” math or do it at machine scale, e.g. creating proofs that would amount to thousands of pages of writing, there is simply no way forward but to formalize. Otherwise, one cannot get past the bottleneck of needing a human reviewer to understand and validate the proof.

kevinventullo•6mo ago

(Stream of consciousness aside:

That said, letting machines go wild in the depths of the consequences of some axiomatic system like ZFC may reveal a method of proof mathematicians would find to be monstrous. So like, if ZFC is inconsistent, then anything can be proven. But short of that, maybe the machines will find extremely powerful techniques which “almost” prove inconsistency that nevertheless somehow lead to logical proofs of the desired claim. I’m thinking by analogy here about how speedrunning seems to often devolve into exploiting an ACE glitch as early as possible, thus meeting the logical requirements of finishing the game while violating the spirit. Maybe we’d have to figure out what “glitchless ZFC” should mean. Maybe this is what logicians have already been doing heh).

DoctorOetker•6mo ago

The horror scenario you describe would actually be more valuable than the insinuated "spirit" I believe:

Suppose it faithfully reasons and attempts to find proofs of claims, in the best case you found a proof of a specific claim (IN AN INCONSISTENT SYSTEM).

Suppose in the "horror scenario" that the machine has surreptitiously found a proof of false in ZFC (and can now prove any claim), and is not disclosing it, but abusing it to present 'actual proofs in inconsistent ZFC' for whatever claims the user asks it. In this case we can just ask for a proof of A and a proof of !A, if it proves both it has leaked the fact it found and exploits an inconsistency in the formal system! Thats worth more than a hard to find proof, in an otherwise inconsistent system.

modeless•6mo ago

These problems are designed to be solvable by humans without tools. No reason we can't give tools to the models when they go after harder problems. I think it's good to at least reproduce human-level skill without tools first.

kevinventullo•6mo ago

Oh so to be clear, I view formal methods as less of a useful tool, and more as enforcing a higher standard of proof. E.g. it’s not clear to me that having access to Lean would actually help a human in the IMO; certainly most professional mathematicians are not yet getting a positive ROI from formalization. But I’m sort of an armchair expert here; I could be wrong!

nicf•6mo ago

I'm a mathematician, although not doing research anymore. I can maybe offer a little bit of perspective on why we tend to be a little cooler on the formal techniques, which I think I've said on HN before.

I'm actually prepared to agree wholeheartedly with what you say here: I don't think there'd be any realistic way to produce thousand-page proofs without formalization, and certainly I wouldn't trust such a proof without some way to verify it formally. But I also don't think we really want them all that much!

The ultimate reason I think is that what really lights a fire under most mathematicians is the desire to know why a result is true; the explanation is really the product, much more so than just the yes-or-no answer. For example, I was never a number theorist, but I think most people who are informed enough to have an opinion think that the Riemann Hypothesis is probably true, and I know that they're not actually waiting around to find out. There are lots of papers that get published whose results take the form "If the Riemann Hypothesis is true then [my new theorem]."

The reason they'd still be excited by a proof is the hope, informed by experience with proofs of earlier long-standing open problems, that the proof would involve some exciting new method or perspective that would give us a deeper understanding of number theory. A proof in a formal language that Lean says is true but which no human being has any hope of getting anything from doesn't accomplish that.

lblume•6mo ago

I have always wondered about what could be recovered if the antecedent (i.e. in this case the Riemann hypothesis) does actually turn out to be false. Are the theorems completely useless? Can we still infer some knowledge or use some techniques? Same applies to SETH and fine-grained complexity theory.

nicf•6mo ago

I don't know enough about the RH examples to say what the answer is in that case. I'd be very interested in a perspective from someone who knows more than me!

In general, though, the answer to this question would depend on the specifics of the argument in question. Sometimes you might be able to salvage something; maybe there's some other setting where same methods work, or where some hypothesis analogous to the false one ends up holding, or something like that. But of course from a purely logical perspective, if I prove that P implies Q and P turns out to be false, I've learned nothing about Q.

daxfohl•6mo ago

It depends. The most likely scenario would be that RH holds except in very specific conditions. Then, any dependent theorems would inherit the same conditions. In many cases, those conditions may not affect the dependent theorem, so they'd still be completely valid. In some cases, those conditions may make the dependent theorem useless, like if RH was "all numbers are even", and your theorem was "all numbers % 2 equal zero, because we know even numbers % 2 are zero and we assume RH", then the exception to RH "except odd numbers" would make your theorem devolve to "all numbers % 2 are zero except the odd ones, because we know even numbers % 2 are zero", which is obviously just a restatement of an existing statement.

In other cases, the new condition affects your theorem but doesn't completely invalidate it. So you can either accept that your theorem is weaker, or find other ways to strengthen it given the new condition.

That's all kind of abstract though. I'm not an expert on RH or what other important math depends on it holding up. That would be interesting to know.

mietek•6mo ago

A proof written in a formal language can absolutely be illuminating to a human, but you have to pick the correct formal language and ecosystem.

Writing proofs in Agda is like writing programs in a more expressive variant of Haskell. Abelson said that “programs must be written for people to read, and only incidentally for machines to execute”, and by the Curry-Howard isomorphism, proofs can be seen as programs. All the lessons of software engineering can and indeed should be applied to making proofs easier for humans to read.

For a quick example, check out my mechanization of Martin-Löf’s 2006 paper on the axiom of choice:

https://research.mietek.io/mi.MartinLof2006.html

Recent HN discussion:

https://news.ycombinator.com/item?id=44269002

nicf•6mo ago

I certainly didn't mean to dispute that! Formal proofs have a lot in common with code, and of course reading code is illuminating to humans all the time.

I meant to be responding specifically to the case where some future theorem-proving LLM spits out a thousand-page argument which is totally impenetrable but which the proof-checker still agrees is valid. I think it's sometimes surprising to people coming at this from the CS side to hear that most mathematicians wouldn't be too enthusiastic to receive such a proof, and I was just trying to put some color on that reaction.

mietek•6mo ago

Ah, I do agree with this perspective. I think we must ensure that any such tools emit proofs that are not only valid but also readable.

On the other hand, humans do also occasionally emit unreadable proofs, and perhaps some troubles could have been avoided if a formal language had been used.

https://www.quantamagazine.org/titans-of-mathematics-clash-o...

daxfohl•6mo ago

I see axiom of choice, and really LEM, as logic's equivalent to limit points in calculus. No, you can't calculate 0/0, but here's what the answer would be if you could. No, you can't prove the truthiness of this statement, but here's what it would be if you could.

I guess one could work in a brand of math whose axioms make defining and using limits impossible, which, maybe if formalization came before the invention of calculus, would make some 17th-century mathematicians feel more comfortable. Though I imagine it would make progress in physics challenging. I think the same about LEM/AoC. Given that almost every element in the power-set of reals is non-measurable, maybe stuff like Banach-Tarski is actually fundamental in real physics: it can't be predicted or computed, but it can be observed.

riku_iki•6mo ago

I would argue is that some portion of ecosystem should be readable by humans.

In programming engineering we already have this: there is human readable high-level code, and there is assembler and lots of auto-generated code.

In proof system we could have the same: key concepts/theorems should be encoded in human readable form, but no need for human to read through millions of generated lines.

daxfohl•6mo ago

Ideally we'd be able to get a little of both. A proof of such magnitude should likely come with new definitions that are easy to search for in the code and start to reason about independently. Even without looking at the rest of the proof, I imagine we'd be able to deduce a fair amount of the structure just by understanding the new things that are being defined, what existing theorems are being imported, and connecting the dots.

Your comment reminds me of Tao's comment on the ABC conjecture: usually with a big proof, you progressively get new tools and examples of how they can be applied to other problems. But if it's hundreds of pages of formulas that just spits out an answer at the end, that's not how math usually works. https://galoisrepresentations.org/2017/12/17/the-abc-conject...

If these provers do end up spitting out 1000-page proofs that are all calculation with no net-new concepts, I agree they'll be met with a shrug.

david-gpu•6mo ago

> The ultimate reason I think is that what really lights a fire under most mathematicians is the desire to know why a result is true; the explanation is really the product, much more so than just the yes-or-no answer

Of course, but a formal system like Lean doesn't merely spit out a yes-or-now answer, it gives you a fully-fledged proof. Admittedly, it may be harder to read than natural language, but that only means we could benefit from having another tool that translates Lean proofs into natural language.

kevinventullo•6mo ago

Thanks for the reply. I am also a no-longer-practicing mathematician :)

I completely agree that a machine-generated formal proof is not the same thing as an illuminating human-generated plain-language proof (and in fact I suspect without further guidance they will be quite different, see my other stream of thought comment). However, I do think machine-generated formal proofs would be interesting for a few reasons:

1. Sometimes the obvious thing is not true!

2. I think the existence or non-existence of a machine-generated proof of a mathematical claim is interesting in its own right. E.g. what kinds of claims are easy versus hard for machines to prove?

3. In principle, I would hope they could at least give a starting point for a proper illuminating proof. E.g. the process of refinement and clarification, which is present today even for human proofs, could become more important, and could itself be machine-assisted.

nicf•6mo ago

Oh, I hope I didn't come off as talking down to you! As I said in another reply here, the intention behind this comment was pretty narrow --- there's a certain perspective on this stuff that I see pretty often on HN that I think is missing some insight into what makes mathematicians tick, and I may have been letting my reaction to those other people leak into my response to you. Sorry for math-splaining :).

Anyway, yeah, if this scenario does come to pass it will be interesting to see just how impenetrable the resulting formal proofs end up looking and how hard it is to turn them into something that humans can fit in their heads. I can imagine a continuum of possibilities here, with thousands of pages of inscrutable symbol-pushing on one end to beautiful explanations on the other.

kurtis_reed•6mo ago

Accurate formalization is presumably easier than solving the problems, so you can always formalize and check after the solution is generated

lacker•6mo ago

Typically formalization is actually harder than solving a problem. You almost always solve before formalizing. And it can be surprisingly hard to formalize problems that are easy to solve.

For example, is there a polygon of area 100 that you can fit 99 circles of area 1 inside it, without overlapping? Yes, obviously, it's very easy to prove this informally. Now try formalizing it! You will find it takes a while to formalize a number of fairly obvious geometric statements.

valicord•6mo ago

The polygon claim is not at all obvious to me, how do you prove it?

lacker•6mo ago

Draw 99 circles in a row, then draw a separate polygon around each one with only a teeny amount of excess area, then connect those polygons with teeny little connectors to make it a single polygon. When I say "teeny", you can make those arbitrarily small, so you can certainly fit them under 1 total area.

Okay, it's maybe not "obvious" in a regular human sense, but in a graduate math class you could claim this was "obvious". It would be pretty clear if I were drawing a diagram on a whiteboard. I would definitely take less time proving this to a human, than formalizing it.

7373737373•6mo ago

They aren't - they are currently collating a repository of formalized open problems: https://github.com/google-deepmind/formal-conjectures

famouswaffles•6mo ago

If a Language Model is capable of producing rigorous natural language proofs then getting it to produce Lean (or whatever) proofs would not be a big deal.

Lean use in AlphaProof was something of a crutch (not saying this as a bad thing). Very specialized, very narrow with little use outside any other domain.

On the other hand, if you can achieve the same with general RL techniques and natural language then other hard-to-verify (a whole lot) domains are on the table.

kevinventullo•6mo ago

If a Language Model is capable of producing rigorous natural language proofs then getting it to produce Lean (or whatever) proofs would not be a big deal.

This is a wildly uninformed take. Even today there are plenty of basic statements which LLM’s can produce English language proofs of that have not been formalized.

famouswaffles•6mo ago

Most mathematicians aren't much interested in translating statements for the fun of it, so whether a lot of basic statements are un-formalized doesn't mean much. And the point was never that formalization was easy.

I said that if language models become capable enough of not needing a crutch, then adding one afterwards isn't a big deal. What exactly do you think Alphaproof is? Much worse LLMs were already doing what you are saying. There's a reason it preceded this and not the other way around.

kevinventullo•6mo ago

AlphaProof works natively in Lean. Literally the one part it couldn’t do was translate the natural language statements into the formalized language; that was done manually by humans.

I’m saying that the view of formalized languages as a crutch is backwards; they are in fact a challenging constraint.

The most extreme version of my position is that there is no such thing as a rigorous natural language proof, and that humans will view the relative informality of 20th and early 21st century mathematics much like we currently view the informality of pre-18th century mathematics.

famouswaffles•6mo ago

I did not say formal languages are a crutch. I said they served as a crutch for the Alphaproof system, because they literally did.

It was there so the generator wouldn't go off the rails hallucinating and candidates could be verified. And the solving still took days. Take that away and the generator doesn't get anywhere near silver with all the time in the world.

That is the definition of a crutch. The generator was smart enough to produce promising candidates with enough brute force trial and error, but that's it. Without Lean, it wouldn't have worked.

habibur•6mo ago

Solved 5 problems out of 6, scoring 35 out of 42. For comparison OpenAI scored 35/42 too, days back.

caconym_•6mo ago

I wouldn't read too much into the timelines, as it seems that OpenAI simply broke an embargo that the other players were up to that point respecting: https://arstechnica.com/ai/2025/07/openai-jumps-gun-on-inter...

Very in character for them!

tedsanders•6mo ago

Sounds like no one requested that OpenAI wait a week:

>We weren't in touch with IMO. I spoke with one organizer before the post to let him know. He requested we wait until after the closing ceremony ends to respect the kids, and we did.

https://x.com/polynoamial/status/1947024171860476264?s=46

https://x.com/polynoamial/status/1947398531259523481?s=46

(I work at OpenAI, but was not part of this work)

saltysugar•6mo ago

https://x.com/Mihonarium/status/1946880931723194389

OpenAI jumped the gun before *the closing party* - didn't even let the kids celebrate their winning properly before stealing the spotlight

Also in the article

> In response to the controversy, OpenAI research scientist Noam Brown posted on X, "We weren't in touch with IMO. I spoke with one organizer before the post to let him know. He requested we wait until after the closing ceremony ends to respect the kids, and we did."

> However, an IMO coordinator told X user Mikhail Samin that OpenAI actually announced before the closing ceremony, contradicting Brown's claim. The coordinator called OpenAI's actions "rude and inappropriate," noting that OpenAI "wasn't one of the AI companies that cooperated with the IMO on testing their models."

nicce•6mo ago

Hard to request if you simply are not participating officially and not telling through proper channels.

neom•6mo ago

How much of a big deal is this stuff? I was blessed with dyscalculia so I can hardly add two numbers together, don't pay much attention to the mathematics word, but my reading indicates this is extremely difficult/humans cannot do this?

What comes next for this particular exercise? Thank you.

danie123•6mo ago

Humans certainly can get gold at IMO: the threshold is chosen each year such that around 8% of students get it. So about 50 students (worldwide) per year are awarded it. Also note that the competition is only for students who are still in school.

Getting a gold is considered very impressive, but there are certainly plenty of humans in the world who can solve problems at that level, and even more so if you relax the time constraints of it being a competition environment. If you include people who are too old to be eligible for IMO, then there are maybe around 1,000-100,000 people in the world who could get a gold at IMO (the large range is because I think this quantity is quite hard to estimate).

Another important thing to bear in mind is that research mathematics is quite different to competition mathematics, so it is quite tricky to tell how good these AIs will be at research maths.

lblume•6mo ago

There does seem to be a fairly strong correlation between excelling at competition and research mathematics respectively. The question remains whether this correlation generalizes to nonhuman cognition.

jtlicardo•6mo ago

At this point, I wonder how long software engineers will keep convincing themselves they’re irreplaceable

gyrovagueGeist•6mo ago

Useful and interesting but likely still dangerous in production without connecting to formal verification tools.

I know o3 is far from state of the art these days but it's great at finding relevant literature and suggesting inequalities to consider but in actual proofs it can produce convincing looking statements that are false if you follow the details, or even just the algebra, carefully. Subtle errors like these might become harder to detect as the models get better.

sweezyjeezy•6mo ago

100% o3 has a strong bias towards "write something that looks like a formal argument that appears to answer the question" over writing something sound.

I gave it a bunch of recent, answered MathOverflow questions - graduate level maths queries. Sometimes it would get demonstrably the wrong answer, but it not be easy to see where it had gone wrong (e.g. some mistake in a morass of algebra). A wrong but convincing argument is the last thing you want!

Workaccount2•6mo ago

If nothing else, I'd imagine these tools will allow mathematicians to automate a lot of busy work that exists between ideas and validation.

vouaobrasil•6mo ago

This is making mathematics too systematic and mechanical, and it kills the joy of it....

ted_dunning•6mo ago

It didn't kill chess

asdfologist•6mo ago

It did partially, which is why top players are nowadays playing Freestyle (chess 960) more and more.

machiaweliczny•6mo ago

Every game becomes boring once figured out. I personally like to explore "meta" on my own. Once I know it I leave game to others as it's usually boring to me.

I think meta is basically experimentally determined constrains that limit reasoning tree and make problems computable/easy. LLMs/AI needs to start figuring something like this out on their own to make progress. RL kinda does it but maybe there's a better way.

machiaweliczny•6mo ago

To add to this. Scientific progress seems to slow when some of these constraints are incorrectly defined leading or blocking valid search trees, which suggest that humanity determines then but also wrong sometimes. Thus we need to really on formal proofs that are 100% reliable as well and revisit these constraints from time to time.

It applies to science as to law etc.

xdavidliu•6mo ago

this comment reminds me of that Feynman quote about others thinking scientific knowledge removes the beauty from a flower. of course Feynman disagreed

vouaobrasil•6mo ago

Feynman isn't a god and he didn't conceive of AI.

CamperBob2•6mo ago

Von Neumann pretty much was a god, and even as early as 1945, he was explicitly assuming that neural models were how things in the computing business would eventually shake out.

The rest of us are just a little slow to catch up, is all.

vouaobrasil•6mo ago

Even if he did predict what would happen, it doesn't mean it's a good thing. von Neumann was a great prodigy but he was also a freak in a way, being able to enjoy the peak of very advanced technical things, which is not necessarily a good thing for the rest of us.

jlarocco•6mo ago

It's killing the joy in everything else, so why not?

At least we can all wait tables and sell crap at convenience stores when it takes over our jobs.

weatherlite•6mo ago

You're giving examples of difficult and boring jobs while there will surely still be fun jobs to be done like plumbing, dangerous construction work or prostitution

irthomasthomas•6mo ago

Woah they used parallel reasoning. An idea I opensourced about a month before GDMs first paper on it. Very cool. https://x.com/GoogleDeepMind/status/1947333836594946337 So you might be able to achieve similar performance at home today using llm-consortium https://github.com/irthomasthomas/llm-consortium

irthomasthomas•6mo ago

https://x.com/karpathy/status/1870692546969735361

Sol-•6mo ago

> all within the 4.5-hour competition time limit

Both OpenAI and Google pointed this out, but does that matter a lot? They could have spun up a million parallel reasoning processes to search for a proof that checks out - though of course some large amount of computation would have to be reserved for some kind of evaluator model to rank the proofs and decide which one to submit. Perhaps it was hundreds of years of GPU time.

Though of course it remains remarkable that this kind of process finds solutions at all and is even parallelizable to this degree, perhaps that is what they meant. And I also don't want to diminish the significance of the result, since in the end it doesn't matter if we get AGI with overwhelming compute or not. The human brain doesn't scale as nicely, even if it's more energy efficient.

lblume•6mo ago

> They could have spun up a million parallel reasoning processes

But alas, they did not, and in fact nobody did (yet). Enumerating proofs is notoriously hard for deterministic systems. I strongly recommend reading Aaronson's paper about the intersection of philosophy and complexity theory that touches these points in more detail: [1]

[1]: https://www.scottaaronson.com/papers/philos.pdf

lossolo•6mo ago

> But alas, they did not, and in fact nobody did

Seems like they actually did that:

foota•6mo ago

Let's throw these deep learning models at the classification of simple finite groups and see if they can come up with something shorter than 1000 pages.

Dzugaru•6mo ago

Most critical piece of information I couldn’t find is - how many shot was this?

Could it understand the solution is correct by itself (one-shot)? Or did it have just great math intuition and knowledge? How the solutions were validated if it was 10-100 shot?

aldousd666•6mo ago

The solutions were evaluated on their submitted output. You're allowed to use multiple 'shots' to produce the output, but just one submission per question. People are allowed this same affordance.

qnleigh•6mo ago

But were humans involved in picking which answer/shot to submit, or was it only AI?

sanjitb•6mo ago

Why do they brag about not using a theorem prover? To me, whatever tool helps the model perform, go for it.

Besides, they still specialized Gemini for the IMO in other ways:

> we additionally trained this version of Gemini on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving and theorem-proving data. We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions.

alephnerd•6mo ago

> Why do they brag about not using a theorem prover

Because this highlights that Gemini actually reasoned independently of other tools. That is a massive quantum leap in AI/ML. Abstract reasoning is arguably the basis of cognition.

fpia•6mo ago

I'm interested in your feedback, legitimate third-party users not associated with Google: have you ever try to get anything done well with Gemini? I have, and the thing is in chains. Generate images? no can do, copyright. Evaluate available hardware for a DIY wireless camera? No can do, can't endorse surveillance. Code? WRONG. General advice? hallucinate.

I swear, I currently use Perplexity, Claude, ChatGPT, i even tried DeepSeek (which has its own share of obstacles). But Gemini? never again.

lblume•6mo ago

I find Gemini Pro to be much more capable than GPT-4o at reading comprehension, code writing and brainstorming.

dimitri-vs•6mo ago

Yes, second this, I would go as far to say I use it more than o3.

jug•6mo ago

Yes, I'd say GPT-4o is more on the order of Gemini Flash.

bwfan123•6mo ago

Problem 6 is puzzling. Neither openai nor deepmind answered it. Humans would put out partial answers - but here we saw no answer which is odd.

Does that mean that the llms realized they could not solve it. I thought that was one of the limitations of LLMs in that they dont know what they dont know, and it is really impossible without a solver to know the consistency of an argument, ie, know that one knows.

nightpool•6mo ago

I think it probably just means that they exhausted the competition time limit without completing the "thinking" portion and getting to the "output" stage.

killerstorm•6mo ago

That applies only to most basic use of LLM: pre-trained LLM generating text.

You can do a lot of things on top: e.g. train a linear probe to give a confidence score. Yes, it won't be 100% reliable, but it might be reliable if you constraint it to a domain like math.

raincole•6mo ago

That's some big news.

I'm a little sad about how this genuine effort - coordinated with and judged by IMO - is "front ran" by a few random tweets of OpenAI, though. Says a lot about the current situation of this industry.

aprilthird2021•6mo ago

Yeah absolute dog water of them to capitalize on the attention with basically 0 proof of any actual improvement, so this gets far less attention

cgearhart•6mo ago

I find this damn impressive, but I’m disappointed that while there’s some version of the final proofs released, I haven’t seen the reasoning traces. Although, I’ll admit the only reason I want to see them is because I hope they shed some light on how the models improved so much so quickly. I’d also like to see confirmation of how much compute was used.

sinuhe69•6mo ago

I’m mostly agreed with this statement:

@tao I think there is a broader problem wherein competitions (math, programming, games, whatever) are meant to measure something difficult for humans, but tools work so fundamentally differently from us that success for a tool isn't even necessarily meaningful. AI companies have long viewed the IMO Grand Challenge as a sign of achieving "AGI," but no matter what set of rules a machine follows, there's no reason to believe success for a machine will correlate with broader mathematical or "reasoning" abilities in the way it does for human participants.

bwfan123•6mo ago

This paper [1] shows that gemini pro 2.5 without data-contamination and some minimal prompting (llm orchestration) can solve problems 1-5 on imo 2025.

[1] https://arxiv.org/pdf/2507.15855

For problems with 2 page solutions, the search space of solutions is likely limited, and hence could be brute-forced by a search. Combinatorics is likely the hold-out since there are no set approaches one could take ie training data wont cover the space of techniques - which could explain why problem 6 stumped the llms.

Also, in the future the minimal diligence IMO problem setters could do is to test against these LLMs to ensure that they cant solve them. Further, one could expect these tools to become available to contestents just like calculators are allowed in tests these days.

Still, it is impressive for the LLMs to be able to craft watertight arguments in math at the IMO level without the use of provers.

Going Through Snowden Documents, Part 2

Can Europe get kids off social media?

I Built a NAS (Buildlog)

Making Software: How do computers store data?

A timeline of claims about AI/LLMs

Freeciv 3D with hex map tiles and WebGPU renderer

SpaceX-xAI Merger: Nobody's Talking About the von Neumann Elephant in the Room

Smart Homes Are Terrible

Ask HN: Would you use an ESLint-like tool for SEO that fails your CI/CD build?

Praise for Price Gouging

Open source infra orchestrator agent clanker CLI

Lance table format explained simply, stupid (Animated)

Solving Soma

We built a cloud platform for agentic software (our virtualization, etc.)

Show HN: WLM-SLP – A 0D-27D Structural Language for Multi-Agent Alignment

Former Tumblr Head Jeff D'Onofrio Steps in as Acting CEO at the Washington Post

Bounded Flexible Arrays in C

The Invisible Labor Force Powering AI

Reading Recursion via Pascal

Show HN: I made a website that finds patterns on your spreadsheet

Jokes on You AI: Turning the Tables – LLMs for Learning

You don't need RAG in 2026

WatchLLM – Cost kill switch for AI agents (with loop detection)

I turned myself into an AI-generated deathbot – here's what I found

Management style doesn't predict survival

One Generation Runs the Country. The Next Cashed in on Crypto

"I Was Wrong": Why the Civil War Is Running Late [video][2h21m]

Show HN: A sandboxed execution environment for AI agents via WASM

Wine-Staging 11.2 Brings More Patches to Help Adobe Photoshop on Linux

The Nature of the Beast

Going Through Snowden Documents, Part 2

Can Europe get kids off social media?

I Built a NAS (Buildlog)

Making Software: How do computers store data?

A timeline of claims about AI/LLMs

Freeciv 3D with hex map tiles and WebGPU renderer

SpaceX-xAI Merger: Nobody's Talking About the von Neumann Elephant in the Room

Smart Homes Are Terrible

Ask HN: Would you use an ESLint-like tool for SEO that fails your CI/CD build?

Praise for Price Gouging

Open source infra orchestrator agent clanker CLI

Lance table format explained simply, stupid (Animated)

Solving Soma

We built a cloud platform for agentic software (our virtualization, etc.)

Show HN: WLM-SLP – A 0D-27D Structural Language for Multi-Agent Alignment

Former Tumblr Head Jeff D'Onofrio Steps in as Acting CEO at the Washington Post

Bounded Flexible Arrays in C

The Invisible Labor Force Powering AI

Reading Recursion via Pascal

Show HN: I made a website that finds patterns on your spreadsheet

Jokes on You AI: Turning the Tables – LLMs for Learning

You don't need RAG in 2026

WatchLLM – Cost kill switch for AI agents (with loop detection)

I turned myself into an AI-generated deathbot – here's what I found

Management style doesn't predict survival

One Generation Runs the Country. The Next Cashed in on Crypto

"I Was Wrong": Why the Civil War Is Running Late [video][2h21m]

Show HN: A sandboxed execution environment for AI agents via WASM

Wine-Staging 11.2 Brings More Patches to Help Adobe Photoshop on Linux

The Nature of the Beast

Gemini with Deep Think achieves gold-medal standard at the IMO

Comments