Epoch confirms GPT5.4 Pro solved a frontier math open problem

https://epoch.ai/frontiermath/open-problems/ramsey-hypergraphs

147•in-silico•2h ago

Comments

6thbit•1h ago

> Subsequent to this solve, we finished developing our general scaffold for testing models on FrontierMath: Open Problems. In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh).

Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?

inkysigma•1h ago

I think in this context, scaffolds are generally the harness that surrounds the actual model. For example, any tools, ways to lay out tasks, or auto-critiquing methods.

I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.

readitalready•1h ago

Usually involves a lot of agents and their custom contexts or system prompts.

karmasimida•1h ago

No denial at this point, AI could produce something novel, and they will be doing more of this moving forward.

XCSme•45m ago

Not sure if AI can have clever or new ideas, it still seems to be it combines existing knowledge and executes algoritms.

I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.

nkozyra•31m ago

Clever/novel ideas are very often subtle deviations from known, existing work.

Sometimes just having the time/compute to explore the available space with known knowledge is enough to produce something unique.

salomonk_mur•16m ago

There is no such thing. All new ideas are derived from previous experiences and concepts.

dotancohen•8m ago

We call that Standing On The Shoulders Of Giants and revere Isaac Newton as clever, even though he himself stated that he was standing on the shoulders of giants.

osti•1h ago

Seems like the high compute parallel thinking models weren't even needed, both the normal 5.4 and gemini 3.1 pro solved it. Somehow Gemini 3 deepthink couldn't solve it.

renewiltord•1h ago

Fantastic news! That means with the right support tooling existing models are already capable of solving novel mathematics. There’s probably a lot of good mathematics out there we are going to make progress on.

pinkmuffinere•53m ago

As someone with only passing exposure to serious math, this section was by far the most interesting to me:

> The author assessed the problem as follows.

> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]

How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.

ramblingrain•39m ago

Read about Paul Erdös... not all math is the Riemann Hypothesis, there is yeoman's work connecting things and whatever...

qnleigh•7m ago

For number of mathematicians familiar with and actively working on the problem, modern mathematics research is incredibly specialized, so it's easy to keep track of who's working on similar problems. You read each other's papers, go to the same conferences etc.

For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).

an0malous•53m ago

I feel like there’s a fork in our future approaching where we’ll either blossom into a paradise for all or live under the thumb of like 5 immortal VCs

XCSme•44m ago

Change is always hard, even if it will be good in 20 years, the transitions are always tough.

reverius42•24m ago

Sometimes the transition is tough and then the end state is also worse!

Hoping that won't be the case with AI but we may need some major societal transformations to prevent it.

johnfn•52m ago

I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)

sublinear•48m ago

You might be joking, but you're probably also not that far off from reality.

I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.

We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.

johnfn•44m ago

I am kind of joking, but I actually don't know where the flaw in my logic is. It's like one of those math proofs that 1 + 1 = 3.

If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.

throw310822•31m ago

You can spend countless "tokens" solving minesweeper or sudoku. This doesn't mean that you solved difficult problems: just that the solutions are very long and, while each step requires reasoning, the difficulty of that reasoning is capped.

pinkmuffinere•29m ago

This is interesting, I like the thought about "what makes something difficult". Focusing just on that, my guess is that there are significant portions of work that we commonly miss in our evaluations:

1. Knowing how to state the problem. Ie, go from the vague problem of "I don't like this, but I do like this", to the more specific problem of "I desire property A". In math a lot of open problems are already precisely stated, but then the user has to do the work of _understanding_ what the precise stating is.

2. Verifying that the proposed solution actually is a full solution.

This math problem actually illustrates them both really well to me. I read the post, but I still couldn't do _either_ of the steps above, because there's a ton of background work to be done. Even if I was very familiar with the problem space, verifying the solution requires work -- manually looking at it, writing it up in coq, something like that. I think this is similar to the saying "it takes 10 years to become an overnight success"

gpm•19m ago

Some thoughts.

1. LLMs aren't "efficient", they seem to be as happy to spin in circles describing trivial things repeatedly as they are to spin in circles iterating on complicated things.

2. LLMs aren't "efficient", they use the same amount of compute for each token but sometimes all that compute is making an interesting decision about which token is the next one and sometimes there's really only one follow up to the phrase "and sometimes there's really only" and that compute is clearly unnecessary.

3. A (theoretical) efficient LLM still needs to emit tokens to tell the tools to do the obviously right things like "copy this giant file nearly verbatim except with every `if foo` replaced with `for foo in foo`. An efficient LLM might use less compute for those trivial tokens where it isn't making meaningful decisions, but if your metric is "tokens" and not "compute" that's never going to show up.

Until we get reasonably efficient LLMs that don't waste compute quite so freely I don't think there's any real point in trying to estimate task complexity by how long it takes an LLM.

typs•43m ago

I mean the details are in the post. You can see the conversation history and the mathematician survey on the problem

famouswaffles•26m ago

>The details about human involvement are always hazy and the significance of the problems are opaque to most.

Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.

alberth•49m ago

For those, like me, who find the prompt itself of interest …

> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].

[0] https://epoch.ai/files/open-problems/gpt-5-4-pro-hypergraph-...

[1] https://epoch.ai/files/open-problems/hypergraph-ramsey-gpt-5...

tombert•47m ago

I was trying to get Claude and Codex to try and write a proof in Isabelle for the Collatz conjecture, but annoyingly it didn't solve it, and I don't feel like I'm any closer than I was when I started. AI is useless!

In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.

daveguy•35m ago

New goalpost, and I promise I'm not being facetious at all, genuinely curious:

Can an AI pose an frontier math problem that is of any interest to mathematicians?

I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.

vlinx•29m ago

This is a remarkable result if confirmed independently. The gap between solving competition problems and open research problems has always been significant - bridging that gap suggests something qualitatively different in the model capabilities.

measurablefunc•24m ago

I guess this means AI researchers should be out of jobs very soon.

Validark•17m ago

I have long said I am an AI doubter until AI could print out the answers to hard problems or ones requiring tons of innovation. Assuming this is verified to be correct (not by AI) then I just became a believer. I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world. I really hope we use this intelligence resource to make the world better.

himata4113•7m ago

It's less of solving a problem, but trying every single solution until one works. Exhaustive search pretty much.

It's pretty much how all the hard problems are solved by AI from my experience.

lsc4719•5m ago

That's also the only way how humans solve hard problems.

jMyles•3m ago

There have been both inductive and deductive solutions to open math problems by humans in the past decade, including to fairly high-profile problems.

adventured•4m ago

No, that's precisely solving a problem.

Shotgunning it is an entirely valid approach to solving something. If AI proves to be particularly great at that approach, given the improvement runway that still remains, that's fantastic.

raincole•2m ago

In other words, it's solving a problem.

Epoch confirms GPT5.4 Pro solved a frontier math open problem

Autoresearch on an old research idea

FCC updates covered list to include foreign-made consumer routers

iPhone 17 Pro Demonstrated Running a 400B LLM

Pompeii's battle scars linked to an ancient 'machine gun'

Abusing Customizable Selects

Box of Secrets: Discreetly modding an apartment intercom with Matter

Show HN: Cq – Stack Overflow for AI coding agents

The Resolv hack: How one compromised key printed $23M

Scott Hanselman says he's working on Windows local accounts

IRIX 3dfx Voodoo driver and glide2x IRIX port

Claude Code Cheat Sheet

Dune3d: A parametric 3D CAD application

Finding all regex matches has always been O(n²)

Ju Ci: The Art of Repairing Porcelain

Windows 3.1 tiled background .bmp archive

Local Stack Archived their GitHub repo and requires an account to run

TI-89 Height-Mapped Raycaster

Two pilots dead after plane and ground vehicle collide at LaGuardia

Trivy under attack again: Widespread GitHub Actions tag compromise secrets

Sunsetting the Techempower Framework Benchmarks

An incoherent Rust

How I'm Productive with Claude Code

A retro terminal music player inspired by Winamp

BIO: The Bao I/O Coprocessor

I built an AI receptionist for a mechanic shop

Cuba's Fragile Power Grid Finds a Powerful New Partner

I created my first AI-assisted pull request

An unsolicited guide to being a researcher [pdf]

US and TotalEnergies reach 'nearly $1B' deal to end offshore wind projects

Epoch confirms GPT5.4 Pro solved a frontier math open problem

Comments

Epoch confirms GPT5.4 Pro solved a frontier math open problem

Autoresearch on an old research idea

FCC updates covered list to include foreign-made consumer routers

iPhone 17 Pro Demonstrated Running a 400B LLM

Pompeii's battle scars linked to an ancient 'machine gun'

Abusing Customizable Selects

Box of Secrets: Discreetly modding an apartment intercom with Matter

Show HN: Cq – Stack Overflow for AI coding agents

The Resolv hack: How one compromised key printed $23M

Scott Hanselman says he's working on Windows local accounts

IRIX 3dfx Voodoo driver and glide2x IRIX port

Claude Code Cheat Sheet

Dune3d: A parametric 3D CAD application

Finding all regex matches has always been O(n²)

Ju Ci: The Art of Repairing Porcelain

Windows 3.1 tiled background .bmp archive

Local Stack Archived their GitHub repo and requires an account to run

TI-89 Height-Mapped Raycaster

Two pilots dead after plane and ground vehicle collide at LaGuardia

Trivy under attack again: Widespread GitHub Actions tag compromise secrets

Sunsetting the Techempower Framework Benchmarks

An incoherent Rust

How I'm Productive with Claude Code

A retro terminal music player inspired by Winamp

BIO: The Bao I/O Coprocessor

I built an AI receptionist for a mechanic shop

Cuba's Fragile Power Grid Finds a Powerful New Partner

I created my first AI-assisted pull request

An unsolicited guide to being a researcher [pdf]

US and TotalEnergies reach 'nearly $1B' deal to end offshore wind projects