Taking a step back, it is overused exaggeration to the point where words run out quick and newer tech needs to fight with existing words for dominance. Copilot should be the name of an AI agent. Bard should have been just a text generator. Gemini is the name of a liar. Chat is probably the iphone of naming but the GPT suffix says Creativity had not come to work that day.
> We were trying to get a big client for weeks, and they said no and went with a competitor. The competitor already had a terms sheet from the company were we trying to sign up. It was real serious.
> We were devastated, but we decided to fly down and sit in their lobby until they would meet with us. So they finally let us talk to them after most of the day.
> We then had a few more meetings, and the company wanted to come visit our offices so they could make sure we were a 'real' company. At that time, we were only 5 guys. So we hired a bunch of our college friends to 'work' for us for the day so we could look larger than we actually were. It worked, and we got the contract.
> I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.
does he? wasn't sama ousted of YC in some muddy ways after he tried to co-opt in into an OpenAI investment arm, was funny to find the YC Open Research project landing page on yc's website now defunct and pointing how he misrepresented it as a YC project when it was his own
maybe he fears him, but I doubt pg respects him, unless he respects evil, lol
Goog had an official colab with IMO, and we can be sure they got those results under the imposed constraints (last year they allocated ~48h for silver IIRC) and an official grading by the IMO graders.
https://x.com/alexwei_/status/1946477754372985146
> 6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!
That means Google Deepmind is the first OFFICIAL IMO Gold.
https://x.com/demishassabis/status/1947337620226240803
> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!
But this can be verified because the results are public:
Same here. I was impressed by their benchmarks and topping most leaderboards, but in day to day use they still feel so far behind.
o3 is so bad it makes me wonder if I'm being served a different model? My o3 responses are so truncated and simplified as to be useless. Maybe my problems aren't a good fit, but whatever it is: o3 output isn't useful.
Tools having slightly unsuitable built in prompts/context sometimes lead to the models saying weird stuff out of the blue, instead of it actually being a 'baked in' behavior of the model itself. Seen this happen for both Gemini 2.5 Pro and o3.
> It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.
> One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.
> The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an "honorable mention".
> But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways:
* One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, consider a sci-fi scenario in the student is still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)
* Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.
* The team leader gives the students unlimited access to calculators, computer algebra packages, formal proof assistants, textbooks, or the ability to search the internet.
* The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.
* The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.
* Each of the six students on the team submit solutions, but the team leader selects only the "best" solution to submit to the competition, discarding the rest.
* If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.
> In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats indicated above.
> So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.
> Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition. EDIT: In particular, the above comments are not specific to any single result of this nature.
Agree with Tao though, I am skeptical of any result of this type unless there's a lot of transparency, ideally ahead of time. If not ahead of time, then at least the entire prompt and fine-tune data that was used.
apparently IMO emailed them. but then they completed the IMO eval independently.
I don't think anybody thinks AI was competing fair and within the rules that apply to humans. But if the humans were competing on the terms that AI solved those problems on, near-unlimited access to energy, raw compute and data, still very few humans could solve those problems within a reasonable timeframe. It would take me probably months or years to educate myself sufficiently to even have a chance.
Said differently, the students, difficulty of the problems, and time limit are specifically coordinated together, so the amount of joules of energy used to produce a solution is not arbitrary. In the grand scheme of how the tech will improve over time, it seems likely that doesn't matter and the computers will win by any metric soon enough, but Tao is completely correct to point out that you haven't accurately told us what the machines can do today, in July 2025, without telling us ahead of time exactly what rules you are modifying.
A human metaphor for evaluating AI capability - https://news.ycombinator.com/item?id=44622973 - July 2025 (30 comments)
The article also suggests that the system used isn’t too far ahead of their upcoming general "DeepThink" model / feature, which is they announced for this summer.
- OpenAI claims gold-medal performance at IMO 2025 https://news.ycombinator.com/item?id=44613840
- "According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.
According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinators is that it was rude and inappropriate" for OpenAI to do this.
OpenAI wasn't one of the AI companies that cooperated with the IMO on testing their models, so unlike the likely upcoming Google DeepMind results, we can't even be sure OpenAI's "gold medal" is legit. Still, the IMO organizers directly asked OpenAI not to announce their results immediately after the olympiad.
Sadly, OpenAI desires hype and clout a lot more than it cares about letting these incredibly smart kids celebrate their achievement, and so they announced the results yesterday." https://x.com/mihonarium/status/1946880931723194389
They requested week after?
https://xcancel.com/polynoamial/status/1947024171860476264?s...
It appears that OpenAI didn't officially enter (whereas Google did), that they knew Google was going to gold medal, and that they released their news ahead of time (disrespecting the kids and organizers) so they could scoop Google.
Really scummy on OpenAI's part.
The IMO closing ceremony was July 19. OpenAI announced on the same day.
IMO requested the tech companies to wait until the following week.
This cowardly bullshit followed by the grandstanding on Twitter is high-school bully behaviour.
> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!
From https://x.com/demishassabis/status/1947337620226240803
Was OpenAI simply not coordinating with the IMO Board then?
You are still surprised by sama@'s asinineness? You must be new here.
...Except they had to substantially bend the rules of the game (limiting the hero pool, completely changing/omitting certain mechanics) to pull this off. So they ended up beating some human Dota pros at a psuedo-Dota custom game, which was still impressive, but a very much watered-down result beneath the marketing hype.
It does seem like Money+Attention outweigh Science+Transparency at OpenAI, and this has always been the case.
I'd never say it's impossible, but the job wasn't finished yet.
Hero drafting and strategy is a major aspect of competitive Dota 2.
Its a great way to do PR but its a garbage way to to science.
Your comparison with chess engines is pretty spot-on, that's how the best of the best chess players do prep nowadays. Gone are the multi person expert teams that analysed positions and offered advice. They now have analysts that use supercomputers to search through bajillions of positions and extract the best ideas, and distill them to their players.
I was recently researching AI's for this, seems it would be a huge unlock for some parts of science where this is the case too like chess
The Wikipedia doesn't have much info on the results, but from other reading I got the impression that the combination produced results stronger than any individual human or computer player.
I don't think the interval between "computers are almost as strong as humans" and "computers are so much stronger than humans that there's no way for even the strongest humans to contribute anything that improves the computer's play" was very long. We'll see whether mathematics is any different...
This is not true, at least not in very long time formats like correspondence chess: https://en.chessbase.com/post/correspondence-chess-and-corre...
There's also many well known cases where even very strong engines miscalculate and can be beaten (especially in fast time controls or closed positions): https://www.chess.com/blog/SamCopeland/hikaru-nakamura-crush...
The horizon effect is still very real in engines, although it's getting harder and harder to exploit.
Maybe you're right about correspondence chess. That interview is from 2018 and the machines have got distinctly stronger in that time, but 7 years isn't so long and it could be that human input still has some value for CC.
So, the problem wasn't translated to Lean first. But did the model use Lean, or internet search, or a calculator or Python or any other tool during its internal thinking process? OpenAI said theirs didn't, and I'm not sure if this is exactly the same claim. More clarity on this point would be nice.
I would also love to know the rough order of magnitude of the amount of computation used by both systems, measured in dollars. Being able to do it at all is of course impressive, but not useful yet if the price is outrageous. In the absence of disclosure I'm going to assume the price is, in fact, outrageous.
Edit: "No tool use, no internet access" confirmed: https://x.com/FredZhang0/status/1947364744412758305
> This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit
I don't think tool use would detract from the achievement, necessarily. I'm just interested to know.
It is not clear to me from that paragraph if the model was allowed to call tools on its own or not.
It seems that LLMs excel (relative to other paradigms) in the kind of "loose" creative thinking humans do, but are also prone to the same kinds of mistakes humans make (hallucinations, etc). Just as Lean and other formal systems can help humans find subtle errors in their own thinking, they could do the same for LLMs.
I get the impression not using tools is as part of the point though - to help demonstrate how much mathematical "reasoning" you can get out of just a model on its own.
We know from Google's 2024 IMO work that they have a way to translate natural language proofs to formally verifiable ones. It seems like a natural next step would be to leverage this for RLVR in training/fine-tuning. During training, any piece of reasoning generated by the math LLM could be translated, verified, and assigned an appropriate reward, making the reward signal much denser.
Reward for a fully correct proof of a given IMO problem would still be hard to come by, but you could at least discourage the model from doing wrong or indecipherable things. That plus tons of compute might be enough to solve IMO problems.
In fact it probably would be, right? We already know from AlphaProof that by translating LLM output back and forth between formal Lean proofs, you can search the space of reasoning moves efficiently enough to solve IMO-class problems. Maybe you can cut out the middleman by teaching the LLM via RLVR to mimic formal reasoning, and that gets you roughly the same efficiency and ability to solve hard problems.
The advantage of Lean is that the system checks the solutions, so hallucination is impossible. Of course, one still relies on the problems and solutions being translated to natural language correctly.
Some people prefer difficult to read formally checked solutions over informal but readable solutions. The two approaches are just solving different problems.
But there is another important reason to want to do this reliably in natural language: you can't use Lean for other domains (with a few limited exceptions). They want to train their RL pipelines for general intelligence and make them reliable for long horizon problems. If a tool is needed as a crutch, then it more or less demonstrates that LLMs will not be enough in any domain, and we'll have to wait for traditional AI to catch up for every domain.
Anyway mainly I was curious whether using an interactive prover like Lean would have provided any advantage, or whether that is no longer really the case. My initial take would be that, yes, it should provide a huge advantage. Like in chess and go, it'd allow it to look algorithmically through a huge search space and check which approaches get it closer to resolving, where the AI is "only" responsible for determining what approaches to try.
OTOH, maybe not. Maybe the search space is so big that trying to go through it linearly is a waste of CPU. In which case, plausibly the translation to Lean offers no benefit. And now that I think about it, I could imagine that. When doing problems like these, you kind of have to figure out the overall approach end to end first, fill in any gaps in your logic, and the formalization/writing step is kind of the last thing you do. So I could see where starting on formalization from the start could end up being the wrong approach for IMO-level problems. It'd just be nice to have that confirmed.
The cool thing is that if true, it implies this is something completely different from the chess/go engines that rely on sheer computational power. Not so much of a "deep blue" moment, but more of an existential one.
my point being transformers and llms have all the tailwind of all the infra+lateral discoveries/improvements being put into them.
does that mean they're the one tool to unlock machine intelligence? I dunno
Google https://storage.googleapis.com/deepmind-media/gemini/IMO_202...
With OpenAI that part takes up about 2/3 if the proof even with its fragmented prose. I don't think it does much better.
"Since there are 3k - 3 points on the perimeter of the triangle to be covered, and any sunny line can pass through at most two of them, it follows that 3k − 3 ≤ 2k, i.e. k ≤ 3."
Gemini writes:
Let Tk be the convex hull of Pk. Tk is the triangle with vertices V1 = (1, 1), V2 = (1, k), V3 = (k, 1). The edges of Tk lie on the lines x = 1 (V), y = 1 (H), and x + y = k + 1 (D). These lines are shady.
Let Bk be the set of points in Pk lying on the boundary of Tk. Each edge contains k points. Since the vertices are distinct (as k ≥ 2), the total number of points on the boundary is |Bk| = 3k − 3.
Suppose Pk is covered by k sunny lines Lk. These lines must cover Bk. Let L ∈ Lk. Since L is sunny, it does not coincide with the lines containing the edges of Tk. A line that does not contain an edge of a convex polygon intersects the boundary of the polygon at most at two points. Thus, |L ∩ Bk| ≤ 2. The total coverage of Bk by Lk is at most 2k. We must have |Bk| ≤ 2k. 3k − 3 ≤ 2k, which implies k ≤ 3.
> We achieved this year’s result using an advanced version of Gemini Deep Think – an enhanced reasoning mode for complex problems that incorporates some of our latest research techniques, including parallel thinking. This setup enables the model to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought.
I don't see any parallel thinking, e.g., so that was probably elided in the final results.
> We achieved this year’s result using an advanced version of Gemini Deep Think – an enhanced reasoning mode for complex problems that incorporates some of our latest research techniques, including parallel thinking. This setup enables the model to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought. [...] We will be making a version of this Deep Think model available to a set of trusted testers, including mathematicians, before rolling it out to Google AI Ultra subscribers.
I will be surprised when a model with only the knowledge of a college student can solve these problems.
Also, how is AI going to change a society ruled by competitiveness, where the winner takes all? You may not want to replace your thinking with AI, but your colleagues will. Their smartwatches or smartglasses will outcompete you with ease and your boss will tell you one day that the company doesn't need you anymore.
Think of it again. Today advertisers fight each other with their ad budgets: those who spend more, get more attention and win. Tomorrow everyone will need a monthly subscription to AI for it will be the price of staying competitive, relevant and employed.
I think I have a minority opinion here, but I’m a bit disappointed they seem to be moving away from formal techniques. I think if you ever want to truly “automate” math or do it at machine scale, e.g. creating proofs that would amount to thousands of pages of writing, there is simply no way forward but to formalize. Otherwise, one cannot get past the bottleneck of needing a human reviewer to understand and validate the proof.
That said, letting machines go wild in the depths of the consequences of some axiomatic system like ZFC may reveal a method of proof mathematicians would find to be monstrous. So like, if ZFC is inconsistent, then anything can be proven. But short of that, maybe the machines will find extremely powerful techniques which “almost” prove inconsistency that nevertheless somehow lead to logical proofs of the desired claim. I’m thinking by analogy here about how speedrunning seems to often devolve into exploiting an ACE glitch as early as possible, thus meeting the logical requirements of finishing the game while violating the spirit. Maybe we’d have to figure out what “glitchless ZFC” should mean. Maybe this is what logicians have already been doing heh).
Suppose it faithfully reasons and attempts to find proofs of claims, in the best case you found a proof of a specific claim (IN AN INCONSISTENT SYSTEM).
Suppose in the "horror scenario" that the machine has surreptitiously found a proof of false in ZFC (and can now prove any claim), and is not disclosing it, but abusing it to present 'actual proofs in inconsistent ZFC' for whatever claims the user asks it. In this case we can just ask for a proof of A and a proof of !A, if it proves both it has leaked the fact it found and exploits an inconsistency in the formal system! Thats worth more than a hard to find proof, in an otherwise inconsistent system.
I'm actually prepared to agree wholeheartedly with what you say here: I don't think there'd be any realistic way to produce thousand-page proofs without formalization, and certainly I wouldn't trust such a proof without some way to verify it formally. But I also don't think we really want them all that much!
The ultimate reason I think is that what really lights a fire under most mathematicians is the desire to know why a result is true; the explanation is really the product, much more so than just the yes-or-no answer. For example, I was never a number theorist, but I think most people who are informed enough to have an opinion think that the Riemann Hypothesis is probably true, and I know that they're not actually waiting around to find out. There are lots of papers that get published whose results take the form "If the Riemann Hypothesis is true then [my new theorem]."
The reason they'd still be excited by a proof is the hope, informed by experience with proofs of earlier long-standing open problems, that the proof would involve some exciting new method or perspective that would give us a deeper understanding of number theory. A proof in a formal language that Lean says is true but which no human being has any hope of getting anything from doesn't accomplish that.
In general, though, the answer to this question would depend on the specifics of the argument in question. Sometimes you might be able to salvage something; maybe there's some other setting where same methods work, or where some hypothesis analogous to the false one ends up holding, or something like that. But of course from a purely logical perspective, if I prove that P implies Q and P turns out to be false, I've learned nothing about Q.
Writing proofs in Agda is like writing programs in a more expressive variant of Haskell. Abelson said that “programs must be written for people to read, and only incidentally for machines to execute”, and by the Curry-Howard isomorphism, proofs can be seen as programs. All the lessons of software engineering can and indeed should be applied to making proofs easier for humans to read.
For a quick example, check out my mechanization of Martin-Löf’s 2006 paper on the axiom of choice:
https://research.mietek.io/mi.MartinLof2006.html
Recent HN discussion:
I meant to be responding specifically to the case where some future theorem-proving LLM spits out a thousand-page argument which is totally impenetrable but which the proof-checker still agrees is valid. I think it's sometimes surprising to people coming at this from the CS side to hear that most mathematicians wouldn't be too enthusiastic to receive such a proof, and I was just trying to put some color on that reaction.
On the other hand, humans do also occasionally emit unreadable proofs, and perhaps some troubles could have been avoided if a formal language had been used.
https://www.quantamagazine.org/titans-of-mathematics-clash-o...
Your comment reminds me of Tao's comment on the ABC conjecture: usually with a big proof, you progressively get new tools and examples of how they can be applied to other problems. But if it's hundreds of pages of formulas that just spits out an answer at the end, that's not how math usually works. https://galoisrepresentations.org/2017/12/17/the-abc-conject...
If these provers do end up spitting out 1000-page proofs that are all calculation with no net-new concepts, I agree they'll be met with a shrug.
Of course, but a formal system like Lean doesn't merely spit out a yes-or-now answer, it gives you a fully-fledged proof. Admittedly, it may be harder to read than natural language, but that only means we could benefit from having another tool that translates Lean proofs into natural language.
I completely agree that a machine-generated formal proof is not the same thing as an illuminating human-generated plain-language proof (and in fact I suspect without further guidance they will be quite different, see my other stream of thought comment). However, I do think machine-generated formal proofs would be interesting for a few reasons:
1. Sometimes the obvious thing is not true!
2. I think the existence or non-existence of a machine-generated proof of a mathematical claim is interesting in its own right. E.g. what kinds of claims are easy versus hard for machines to prove?
3. In principle, I would hope they could at least give a starting point for a proper illuminating proof. E.g. the process of refinement and clarification, which is present today even for human proofs, could become more important, and could itself be machine-assisted.
Anyway, yeah, if this scenario does come to pass it will be interesting to see just how impenetrable the resulting formal proofs end up looking and how hard it is to turn them into something that humans can fit in their heads. I can imagine a continuum of possibilities here, with thousands of pages of inscrutable symbol-pushing on one end to beautiful explanations on the other.
For example, is there a polygon of area 100 that you can fit 99 circles of area 1 inside it, without overlapping? Yes, obviously, it's very easy to prove this informally. Now try formalizing it! You will find it takes a while to formalize a number of fairly obvious geometric statements.
Lean use in AlphaProof was something of a crutch (not saying this as a bad thing). Very specialized, very narrow with little use outside any other domain.
On the other hand, if you can achieve the same with general RL techniques and natural language then other hard-to-verify (a whole lot) domains are on the table.
This is a wildly uninformed take. Even today there are plenty of basic statements which LLM’s can produce English language proofs of that have not been formalized.
I said that if language models become capable enough of not needing a crutch, then adding one afterwards isn't a big deal. What exactly do you think Alphaproof is? Much worse LLMs were already doing what you are saying. There's a reason it preceded this and not the other way around.
Very in character for them!
>We weren't in touch with IMO. I spoke with one organizer before the post to let him know. He requested we wait until after the closing ceremony ends to respect the kids, and we did.
https://x.com/polynoamial/status/1947024171860476264?s=46
https://x.com/polynoamial/status/1947398531259523481?s=46
(I work at OpenAI, but was not part of this work)
What comes next for this particular exercise? Thank you.
Getting a gold is considered very impressive, but there are certainly plenty of humans in the world who can solve problems at that level, and even more so if you relax the time constraints of it being a competition environment. If you include people who are too old to be eligible for IMO, then there are maybe around 1,000-100,000 people in the world who could get a gold at IMO (the large range is because I think this quantity is quite hard to estimate).
Another important thing to bear in mind is that research mathematics is quite different to competition mathematics, so it is quite tricky to tell how good these AIs will be at research maths.
I know o3 is far from state of the art these days but it's great at finding relevant literature and suggesting inequalities to consider but in actual proofs it can produce convincing looking statements that are false if you follow the details, or even just the algebra, carefully. Subtle errors like these might become harder to detect as the models get better.
I gave it a bunch of recent, answered MathOverflow questions - graduate level maths queries. Sometimes it would get demonstrably the wrong answer, but it not be easy to see where it had gone wrong (e.g. some mistake in a morass of algebra). A wrong but convincing argument is the last thing you want!
At least we can all wait tables and sell crap at convenience stores when it takes over our jobs.
Both OpenAI and Google pointed this out, but does that matter a lot? They could have spun up a million parallel reasoning processes to search for a proof that checks out - though of course some large amount of computation would have to be reserved for some kind of evaluator model to rank the proofs and decide which one to submit. Perhaps it was hundreds of years of GPU time.
Though of course it remains remarkable that this kind of process finds solutions at all and is even parallelizable to this degree, perhaps that is what they meant. And I also don't want to diminish the significance of the result, since in the end it doesn't matter if we get AGI with overwhelming compute or not. The human brain doesn't scale as nicely, even if it's more energy efficient.
But alas, they did not, and in fact nobody did (yet). Enumerating proofs is notoriously hard for deterministic systems. I strongly recommend reading Aaronson's paper about the intersection of philosophy and complexity theory that touches these points in more detail: [1]
Could it understand the solution is correct by itself (one-shot)? Or did it have just great math intuition and knowledge? How the solutions were validated if it was 10-100 shot?
Besides, they still specialized Gemini for the IMO in other ways:
> we additionally trained this version of Gemini on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving and theorem-proving data. We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions.
Because this highlights that Gemini actually reasoned independently of other tools. That is a massive quantum leap in AI/ML. Abstract reasoning is arguably the basis of cognition.
I swear, I currently use Perplexity, Claude, ChatGPT, i even tried DeepSeek (which has its own share of obstacles). But Gemini? never again.
Does that mean that the llms realized they could not solve it. I thought that was one of the limitations of LLMs in that they dont know what they dont know, and it is really impossible without a solver to know the consistency of an argument, ie, know that one knows.
You can do a lot of things on top: e.g. train a linear probe to give a confidence score. Yes, it won't be 100% reliable, but it might be reliable if you constraint it to a domain like math.
I'm a little sad about how this genuine effort - coordinated with and judged by IMO - is "front ran" by a few random tweets of OpenAI, though. Says a lot about the current situation of this industry.
Problem 5: Both open-ai and gemini pull the constant 1/sqrt(2) out of their ass and proof seek towards it.
Problem 1: Both open-ai and gemini pull out the solution up front and proof-seek towards it. open-ai pulls a "reduction lemma" out of nowhere at the end.
lufenialif2•10h ago
Something that was hotly debated in the thread with OpenAI's results:
"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."
it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.
Doesn't diminish the result, but doesn't seem too different from classical ML techniques if quality of data in = quality of data out.
vonneumannstan•9h ago
Not sure thats exactly what that means. Its already likely the case that these models contained IMO problems and solutions from pretraining. It's possible this means they were present in the system prompt or something similar.
sottol•9h ago
AlotOfReading•9h ago
Obviously the training data contained similar problems, because that's what every IMO participant already studies. It seems unlikely that they had access to the same problems though.
AlanYx•9h ago
Davidzheng•8h ago
apayan•8h ago
https://mathstodon.xyz/@tao
vonneumannstan•7h ago
I don't think I or op suggested it did.
dvh•9h ago
dortlick•9h ago
lufenialif2•9h ago
thrance•9h ago
So funnily enough, "the AI wrote x times the library of Congress to get there" is good enough of a comparison.
rfurmani•8h ago
gus_massa•8h ago
I'm not sure how to implement the "no calculator" rule :) but for this kind of problems it's not critical.
Total = 900Wh = 3.24MJ
qnleigh•3h ago
gus_massa•2h ago
If the computer uses ~600W, let's give it 45+45 minutes and we are even :) If they want to use many GPU ...
nicce•8h ago
So the real cost is something much more.
pfortuny•8h ago
https://cms.math.ca/publications/crux
gjm11•4h ago