Erdős Problem #1026

https://terrytao.wordpress.com/2025/12/08/the-story-of-erdos-problem-126/

161•tzury•1mo ago

Comments

tzury•1mo ago

This case study reveals the future of AI-assisted[1] work, far beyond mathematics.

It relies on a combination of Humans, LLMs ('General Tools'), Domain-Specific Tools, and Deep Research.

It is apparent that the static data encoded within an LLM is not enough; one must re-fetch sources and digest them fresh for the context of the conversation.

In this workflow, AlphaEvolve, Aristotle, and LEAN are the 'PhDs' on the team, while the LLM is the Full Stack Developer that glues them all together.

[1] If one likes pompous terms, this is what 'AGI' will actually look like.

9u127v89yik2j3•1mo ago

The author is the PhD on the team.

Literally not AGI.

philipwhiuk•1mo ago

Aristotle is already an LLM and LEAN combined.

[from the Aristotle paper]

> Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver.

[from elsewhere on how part 2 works]

> To address IMO-level complexity, Aristotle employs a natural language module that decomposes hard problems into lists of informally reasoned lemmas. This module elicits high-level proof sketches and supporting claims, then autoformalizes them into Lean for formal proving. The pipeline features iterative error feedback: Lean verification errors are parsed and fed back to revise both informal and formal statements, iteratively improving the formalization and capturing creative auxiliary definitions often characteristic of IMO solutions.

baq•1mo ago

I have no comments about the result itself, but the process and the AI policy which facilitated it is inspiring and easily transferable to any moderately complicated software engineering problem. Much to learn regardless of the maths.

nsoonhui•1mo ago

But software engineering problems are more fuzzy and less amendable to mathematical analysis, so exactly how can those AI policies developed for math be applied to software engineering problems?

boerseth•1mo ago

Not sure which way the difference puts the pressure. Does the fuzziness require more prudent policies, or allow us to get away with less?

contravariant•1mo ago

Don't use them for the parts that are fuzzy.

I mean it should be obvious that making executive decisions about what the code should do exactly should only be left to a RNG powered model if the choices made are unimportant.

robrenaud•1mo ago

I think you underestimate how powerful lean is, and close it is to the tedious part of formal math. A theorem prover needs consult no outside resource. A formal math LLM-like generator need only consult the theorem prover to get rid of hallucinations. This is why it's actually much easier than SWE to optimize/hill climb on.

Low level, automated theorem providing is going to fall way quicker than most expected, like AlphaGo, precisely because an MCTS++ search over lean proofs is scalable/amendable to self play/relevant to a significant chunk of professional math.

Legit, I almost wish the US and China would sign a Formal Mathematics Profileration Treaty, as a sign of good will between very powerful parties who have much to gain from each other. When your theorem prover is sufficiently better than most Fields medalists alive, you share your arch/algorithms/process with the world. So Mathematics stays in the shared realm of human culture, and it doesn't just happen to belong to DeepMind, OpenAI, or Deepseek.

baq•1mo ago

On the contrary I think we're low key on the verge of model checkers being widely deployed in the industry. I've been experimenting with Opus 4.5 + Alloy and the preliminary results I'm getting are crossing usability thresholds in a step-function pattern (not surprising IMHO), I just haven't seen anyone pick up on it publicly yet.

The workflow I'm envisioning here is the plan document we're all making nowadays isn't being translated directly into code, but into a TLA+/Alloy/... model as executable docs and only then lowered into the code space while conformance is continuously monitored (which is where the toil makes it not worth it most of the time without LLMs). The AI literature search for similar problems and solutions is also obviously helpful during all phases of the sweng process.

robot-wrangler•1mo ago

> The workflow I'm envisioning here is the plan document we're all making nowadays isn't being translated directly into code, but into a TLA+/Alloy/... model as executable docs and only then lowered into the code space while conformance is continuously monitored

I'm sure we've agreed on this before, but I agree again ;) There are dozens of us at least, dozens! There's also a recent uptick in posts with related ideas, for example this hit the front-page briefly ( https://news.ycombinator.com/item?id=46251667 ).

I was tempted to start with alloy/tla for my own experiments along these lines due to their popularity, but since the available training data is so minimal for everything in the space.. I went with something more obscure (MCMAS) just for access to "agents" as primitives in the model-checker.

baq•1mo ago

> available training data is so minimal for everything in the space

Haven't tried anything other than Alloy yet, but I've got a feeling Anthropic has employed some dark arts to synthesize either Alloy models or something closely related and trained Opus on the result - e.g. GPT 5.1 commits basic syntax errors, while Opus writes models like it's just another day at the office.

gaigalas•1mo ago

Is it trivial for any mathematician to understand lean code?

I'm curious if there is a scenario in which a large automated proof is achieved but there would be no practical means of getting any understanding of what it means.

I'm an engineer. Think like this: a large complex program that compiles but you don't understand what it does or how to use it. Is such a thing possible?

JulianWasTaken•1mo ago

It's not trivial for a mathematician to understand Lean code, but it's something that's possible to learn to read and interpret in a day (without then necessarily being proficient in how to write it).

That's true though of Lean code written by a human mathematician.

AI systems are capable (and generally even predisposed to) producing long and roundabout proofs which are a slog to decipher. So yes the feeling is somewhat similar at times to an LLM giving you a large and sometimes even redundant-in-parts program.

gus_massa•1mo ago

With very difficult human generated proof, it's common that it take like 10 or 20 to make it understandable for mortals. The idea is to split the proof, create new notation, add intermedite steps that are nice, find a simpler path. It's like refactoring.

Sometimes the original proof is compleyely replaced, bit by bit, until there is an easy to understand version.

gus_massa•1mo ago

Too late to edit:

"10 or 20" -> "10 or 20 years"

gaigalas•1mo ago

Wow!

If curl developers are overwhelmed by AI PRs, imagine how mathematicians will feel verifying a huge backlog of automated proofs.

Or isn't there such a thing? Can someone make a very complicated automated proof that ultimately reveals itself to be useless?

gus_massa•1mo ago

For math it's easy, everyone just ignore it. There is no Daniel to blame. There are a few post about P=/!=NP or the Riemann conjeture or rewriting physics each week that are posted to HN. I'm just ignoring them. Other mathematicians are just ignoring them. But you will not find anyone to blame.

There are a few "solutions" of conjetures that may be correct, like https://en.wikipedia.org/wiki/Abc_conjecture I'm not sure about the current state. There may be a few mathematicians trying to read some parts, or perhaps no. Perhaps in a few years the easy parts will be refactored and isolated, and published as special cases. And after a while, it may be verified or someone will find a gap and perhaps fix it. Just wait a few decades.

> Can someone make a very complicated automated proof that ultimately reveals itself to be useless?

It depends, on what you consider insightful. Take a look at "Determination of the fifth Busy Beaver value" https://news.ycombinator.com/item?id=45273999 in particular the first comment. Is that an heroic result that opens a lot of paths or a useless combination of tricks that no one will ever understand? (In my opinion a proof is a proof [standing applause emoji].)

gaigalas•1mo ago

What I (personally) consider insightful is irrelevant. It's about what mathematicians consider insightful.

Mathematicians are obviously not ignoring automated proofs. Terry's post is an evidence of that.

Consider LK99 instead of crackpot P vs NP proofs. That wasted a lot of academia time.

It seems that it could happen to math.

QuesnayJr•1mo ago

There's a couple of problems that were solved that way a while ago, and they have been formalized, but not in Lean:

https://en.wikipedia.org/wiki/Four_color_theorem

https://en.wikipedia.org/wiki/Kepler_conjecture

voxl•1mo ago

Yes.. yes.. sure, of course... You neglect this one little detail: theorem proving IS programming. So if an AI can be "better than a fields medalist" (a laughable claim akin to basically calling it AGI) then it will be better than all software engineers everywhere.

But see you neglect something important: it's the programmer that is establishing the rules of the game, and as Grothendieck taught us already, often just setting up the game is ALL of the work, and the proof is trivial.

robrenaud•1mo ago

What is harder, beating Lee Sedol at Go, or physically placing stones on a Go board? Which is closer to AGI?

Because AlphaGo can only do one.

AI could very well be better at formal theorem proving than fields medalists pretty soon. It will not have taste, ability to see the beauty in math, or pick problems and set directions for the field. But given a well specified problem, it can bruteforce search through lean tactics space at an extremely superhuman pace. What is lacks in intuition and brilliance, it will make up in being able to explore in parallel.

There is a quality/quantity tradeoff in search with a verifier. A superhuman artificial theorem prover can be generating much worse ideas on average than a top mathematician, and make up for it by trying many more of them.

It's Kasparov vs DeepBlue and Sedol vs AlphaGo all over.

It's also nowhere near AGI. Embodiment and the real world is super messy. See Moravec's paradox.

Practical programs deal with the outside world, they are underspecified, their utility depends on the changing whims of people. The formal specification of a math problem is self contained.

voxl•1mo ago

Your analogy completely misses the point. What is harder? Designing a game that has relevant implications in physical reality, or playing a game already designed given an oracle that tells you when you make an incorrect move?

riku_iki•1mo ago

> But given a well specified problem, it can bruteforce search through lean tactics space at an extremely superhuman pace. What is lacks in intuition and brilliance, it will make up in being able to explore in parallel.

right, but because math is not well specified and formalized yet, it could be a problem, and that's where humans with intuition and more rigid reasoning still necessary.

clbrmbr•1mo ago

> given a sequence of {k^2+1} distinct real numbers, one can find a subsequence of length {k+1} which is either increasing or decreasing

{-2, 1, -1, 1/2, -1/2, 1/3, -1/3, 1/4, … -1/(k/2)} is a sequence of {k^2+1} distinct real numbers, but the longest increasing or decreasing subsequences are of length 2, not k+1.

What am I missing?

dmehrle•1mo ago

Subsequences need not be contiguous. In your example, taking every other number gives the desired monotone subsequence.

andrepd•1mo ago

Non-consecutive.

seanhunter•1mo ago

The definition of a subsequence is if you have a(n) as a sequence of real numbers and n_1 < n_2 <n_3 < ... is an increasing sequence of integers then

a(n_1), a(n_2), a(n_3), ... is a subsequence of a_n and is denoted a(n_k).

So the indexes don't need to be contiguous, just increasing.

So in your example 2, 1, 1/2, 1/3, ... is a decreasing subsequence.

edit: changed to using function-style notation because the nested subscript notation looks confusing in ascii

clbrmbr•1mo ago

Thanks. I was thinking subsequence ~ substring but that’s a false analogy apparently!

seanhunter•1mo ago

Yeah it’s a bit confusing. It’s also confusing when you see them written because they’re actually written usually with a nested subscript. Like

   a
    n
     k

With the k smaller than the n which is in turn smaller than the a. Sequences of all kinds are just a function from the integers to the reals so I don’t know why we had to go and invent a whole new notation for them just to be extra obtuse.

jebarker•1mo ago

> Within an hour, Koishi Chan gave an alternate proof deriving the required bound {c(k^2) \geq 1/k} from the original Erdős-Szekeres theorem by a standard “blow-up” argument which we can give here in the Alice-Bob formulation.

Is this an example of the 4 minute mile phenomenon or did the AI proof provide key insights that Chan was able to use in their proof?

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

The Optima-l Situation: A deep dive into the classic humanist sans-serif

Barn Owls Know When to Wait

Implementing TCP Echo Server in Rust [video]

LicGen – Offline License Generator (CLI and Web UI)

Service Degradation in West US Region

The Janitor on Mars

Bringing Polars to .NET

Adventures in Guix Packaging

Show HN: We had 20 Claude terminals open, so we built Orcha

Your Best Thinking Is Wasted on the Wrong Decisions

Warcraftcn/UI – UI component library inspired by classic Warcraft III aesthetics

Trump Vodka Becomes Available for Pre-Orders

Velocity of Money

Stop building automations. Start running your business

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

The Optima-l Situation: A deep dive into the classic humanist sans-serif

Barn Owls Know When to Wait

Implementing TCP Echo Server in Rust [video]

LicGen – Offline License Generator (CLI and Web UI)

Service Degradation in West US Region

The Janitor on Mars

Bringing Polars to .NET

Adventures in Guix Packaging

Show HN: We had 20 Claude terminals open, so we built Orcha

Your Best Thinking Is Wasted on the Wrong Decisions

Warcraftcn/UI – UI component library inspired by classic Warcraft III aesthetics

Trump Vodka Becomes Available for Pre-Orders

Velocity of Money

Stop building automations. Start running your business

Erdős Problem #1026

Comments