Claude's Cycles: Claude Opus 4.6 solves a problem posed by Don Knuth [pdf]

https://www-cs-faculty.stanford.edu/~knuth/papers/claude-cycles.pdf

119•fs123•5h ago

Comments

mccoyb•2h ago

It's fascinating to think about the space of problems which are amenable to RL scaling of these probability distributions.

Before, we didn't have a fast (we had to rely on human cognition) way to try problems - even if the techniques and workflows were known by someone. Now, we've baked these patterns into probability distributions - anyone can access them with the correct "summoning spell". Experts will naturally use these systems more productively, because they know how to coerce models into the correct conditional distributions which light up the right techniques.

One question this raises to me is how these models are going to keep up with the expanding boundary of science. If RL is required to get expert behavior into the models, what happens when experts start pushing the boundary faster? In 2030, how is Anthropic going to keep Claude "up-to-date" without either (a) continual learning with a fixed model (expanding context windows? seems hard) or (b) continual training (expensive)?

Crazy times.

Aerroon•1h ago

A bit related: open weights models are basically time capsules. These models have a knowledge cut off point and essentially forever live in that time.

bitexploder•52m ago

This is the most fundamental argument that they are not, directly, an intelligence. They are not ever storing new information on a meaningful timescale. However, if you viewed them on some really large macro time scale where now LLMs are injecting information into the universe and the re-ingesting that maybe in some very philosophical way they are a /very/ slow oscillating intelligence right now. And as we narrow that gap (maybe with a totally new non-LLM paradigm) perhaps that is ultimately what gen AI becomes. Or some new insight that lets the models update themselves in some fundamental way without the insanely expensive training costs they have now.

anematode•42m ago

But they're not "slow"! Unlike biological thinking, which has a speed limit, you can accelerate these chains of thought by orders of magnitude.

Jweb_Guru•9m ago

I assure you that LLM thinking also has a speed limit.

mlyle•26m ago

There's nothing to say that you can't build something intelligent out of them by bolting a memory on it, though.

Sure, it's not how we work, but I can imagine a system where the LLM does a lot of heavy lifting and allows more expensive, smaller networks that train during inference and RAG systems to learn how to do new things and keep persistent state and plan.

charcircuit•12m ago

Memory is not just bolted on top of the latest models. They under go training on how and when to effectively use memory and how to use compaction to avoid running out of context when working on problems.

dtj1123•15m ago

Would you consider someone with anterograde amnesia not to be intelligent?

lxgr•1h ago

Data sharing agreements permitting, today's inference runs can be tomorrow's training data. Presumably the models are good enough at labeling promising chains of thought already.

I could totally imagine "free" inference for researchers under the condition that the reasoning traces get to be used as future training data.

mccoyb•1h ago

Agreed, there's no doubt this will happen. It's likely already happening (it feels safe to assume that Anthropic is curating data from the data they record from Claude Code?)

As far as I understand RL scaling (we've already maxxed out RLVR), these machines only get better as long as they have expert reasoner traces available.

Having an expert work with an LLM and successfully solve a problem is high signal data, it may be the only path forward?

My prior is that these companies will take this data without asking you as much as they can.

lxgr•35m ago

Exactly, or functionally equivalently, asking you in paragraph 37 of a 120-page PDF (bonus points: in an agreement update).

And importantly, this can be cross-lab/model too. I suspect there's a reason why e.g. Google has been offering me free Claude inference in Google Antigravity on a free plan...

DeathArrow•1h ago

They can use LORA.

ainiriand•1h ago

Are not LLMs supposed to just find the most probable word that follows next like many people here have touted? How this can be explained under that pretense? Is this way of problem solving 'thinking'?

IgorPartola•1h ago

In some cases solving a problem is about restating the problem in a way that opens up a new path forward. “Why do planets move around the sun?” vs “What kind of force exists in the world that makes planets tethered to the sun with no visible leash?” (Obviously very simplified but I hope you can see what I am saying.) Given that a human is there to ask the right questions it isn’t just an LLM.

Further, some solutions are like running a maze. If you know all the wrong turns/next words to say and can just brute force the right ones you might find a solution like a mouse running through the maze not seeing the whole picture.

Whether this is thinking is more philosophical. To me this demonstrates more that we are closer to bio computers than an LLM is to having some sort of divine soul.

ainiriand•1h ago

Thanks for your input. The way I saw this and how it looks Knuth interpreted it is that there were some reasoning steps taken by Claude independently. Some internal decisions in the model that made it try different things, finally succeeding.

tux3•1h ago

>Are not LLMs supposed to just find the most probable word that follows next like many people here have touted?

The base models are trained to do this. If a web page contains a problem, and then the word "Answer: ", it is statistically very likely that what follows on that web page is an answer. If the base model wants to be good at predicting text, at some point learning the answer to common question becomes a good strategy, so that it can complete text that contains these.

NN training tries to push models to generalize instead of memorizing the training set, so this creates an incentive for the model to learn a computation pattern that can answer many questions, instead of just memorizing. Whether they actually generalize in practice... it depends. Sometimes you still get copy-pasted input that was clearly pulled verbatim from the training set.

But that's only base models. The actual production LLMs you chat with don't predict the most probable word according to the raw statistical distribution. They output the words that RLHF has rewarded them to output, which includes acting as an assistant that answers questions instead of just predicting text. RLHF is also the reason there are so many AI SIGNS [1] like "you're absolutely right" and way more use of the word "delve" than is common in western English.

[1]: https://en.wikipedia.org/wiki/WP:AISIGNS

esafak•1h ago

Are you feigning ignorance? The best way to answer a question, like completing a sentence, is through reasoning; an emergent behavior in complex models.

dilap•1h ago

That description is really only fair for base models†. Something like Opus 4.6 has all kinds of other training on top of that which teach it behaviors beyond "predict most probable token," like problem-solving and being a good chatbot.

(†And even then is kind of overly-dismissive and underspecified. The "most probable word" is defined over some training data set. So imagine if you train on e.g. mathematicians solving problems... To do a good job at predicting [w/o overfitting] your model will have to in fact get good at thinking like a mathematician. In general "to be able to predict what is likely to happen next" is probably one pretty good definition of intelligence.)

ericd•39m ago

I think it's pretty likely that "intelligence" is emergent behavior that comes when you predict what comes next in physical reality well enough, at varying timescales. Your brain has to build all sorts of world model abstractions to do that over any significant timescale. Big LLMs have to build internal world models, too, to do well at their task.

gpm•29m ago

I'd disagree, the other training on top doesn't alter the fundamental nature of the model that it's predicting the probabilities of the next token (and then there's a sampling step which can roughly be described as picking the most probable one).

It just changes the probability distribution that it is approximating.

To the extent that thinking is making a series of deductions from prior facts, it seems to me that thinking can be reduced to "pick the next most probable token from the correct probability distribution"...

vidarh•18m ago

Put a loop around an LLM and, it can be trivially made Turing complete, so it boils down to whether thinking requires exceeding the Turing computable, and we have no evidence to suggest that is even possible.

gpm•2m ago

What are you doing in your loop?

As typically deployed [1] LLMs are not turing complete. They're closer to linear bounded automaton, but because transformers have a strict maximum input size they're actually a subset of the weaker class of deterministic finite automaton. These aren't like python programs or something that can work on as much memory as you supply them, their architecture works on a fixed maximum amount of memory.

I'm not particularly convinced turing complete is the relevant property though. I'm rather convinced that I'm not turing complete either... my head is only so big after all.

[1] i.e. in a loop that appends output tokens to the input and has some form of sliding context window (perhaps with some inserted instructions to "compact" and then sliding the context window right to after those instructions once the LLM emits some special "done compacting" tokens).

[2] Common sampling procedures make them mildly non-deterministic, but I don't believe they do so in a way that changes the theoretical class of these machines from DFAs.

wrsh07•45m ago

Imagine training a chess bot to predict a valid sequence of moves or valid game using the standard algebraic notation for chess

Great! It will now correctly structure chess games, but we've created no incentive for it to create a game where white wins or to make the next move be "good"

Ok, so now you change the objective. Now let's say "we don't just want valid games, we want you to predict the next move that will help that color win"

And we train towards that objective and it starts picking better moves (note: the moves are still valid)

You might imagine more sophisticated ways to optimize picking good moves. You continue adjusting the objective function, you might train a pool of models all based off of the initial model and each of them gets a slightly different curriculum and then you have a tournament and pick the winningest model. Great!

Now you might have a skilled chess-playing-model.

It is no longer correct to say it just finds a valid chess program, because the objective function changed several times throughout this process.

This is exactly how you should think about LLMs except the ways the objective function has changed are significantly significantly more complicated than for our chess bot.

So to answer your first question: no, that is not what they do. That is a deep over simplification that was accurate for the first two generations of the models and sort of accurate for the "pretraining" step of modern llms (except not even that accurate, because pretraining does instill other objectives. Almost like swapping our first step "predict valid chess moves" with "predict stockfish outputs")

crocowhile•37m ago

Those people still exist? I only know one guy who is still fighting those windmills

qsera•32m ago

Yes, I am one.

qsera•29m ago

Yes, that is exactly what they do.

But that does not mean that the results cannot be dramatic. Just like stacking pixels can result in a beautiful image.

throw310822•11m ago

> just find the most probable word that follows next

Well, if in all situations you can predict which word Einstein would probably say next, then I think you're in a good spot.

This "most probable" stuff is just absurd handwaving. Every prompt of even a few words is unique, there simply is no trivially "most probable" continuation. Probable given what? What these machines learn to do is predicting what intelligence would do, which is the same as being intelligent.

sega_sai•4m ago

In some sense that is still correct, i.e. the words are taken from some probability distribution conditional on previous words, but the key point is that probability distribution is not just some sort of average across the internet set of word probabilities. In the end this probability distribution is really the whole point of intelligence. And I think the LLMs are learning those.

miroljub•1h ago

Solves? It's a part of the training set. Nothing more, nothing less.

mwigdahl•1h ago

Did you read the article? It was an open problem.

bluGill•1h ago

Was it? It was an open problem to Knuth - who generally knows how to search literature. However there is enough literature to search that it wouldn't be a surprise at all to discover it was already solved but he just used slightly different terms and so didn't find it. Or maybe it was sovled because this is a specialization of something that looks unrelated and so he wouldn't have realized it when he read it. Or...

Overall I'm going with unsolved, because Knuth is a smart person who I'd expect to not miss the above. I'm also sure he falls for the above all the time even though the majority of the time he doesn't.

mwigdahl•53m ago

Agreed with all of that, but with the added point that Knuth has done a lot of work in this exact area in The Art of Computer Programming Volume 4. If he considers this conjecture open given his particular knowledge of the field, it likely is (although agreed, it's not guaranteed).

rpdillon•1h ago

Opening sentences:

> Shock! Shock! I learned yesterday that an open problem I’d been working on for several weeks had just been solved by Claude Opus 4.6— Anthropic’s hybrid reasoning model that had been released three weeks earlier! It seems that I’ll have to revise my opinions about “generative AI” one of these days. What a joy it is to learn not only that my conjecture has a nice solution but also to celebrate this dramatic advance in automatic deduction and creative problem solving.

jcims•1h ago

Prove it.

romaniv•5m ago

I would like to note that would be trivial to definitively prove or dispriove such things if we had a searchable archive of the training data - which we do not and will not (by design). In fact, even searching the Internet for prior art is becoming increasingly ineffective, because most search engines have turned into recommendation and influence machines, which are happy to just feed you their own slop.

ecshafer•1h ago

I wonder how long we have until we start solving some truly hard problems with AI. How long until we throw AI at "connect general relativity and quantum physics", give the AI 6 months and a few data centers, and have it pop out a solution?

worldsavior•39m ago

If AGI will ever come, then. Currently, AI is only a statistical machines, and solutions like this are purely based on distribution and no logic/actual intelligence.

rustyhancock•34m ago

I don't even think that's the issue.

The issue to my mind is a lack of data at the meeting of QFT/GR.

Afterall few humans historically have been capable of the initial true leap between ontologies. But humans are pretty smart so we can't say that is a requirement for AGI.

worldsavior•22m ago

When it comes to revolutionary/unsolved subjects, there will never be enough data. That's why its revolutionary/unsolved.

zarzavat•5m ago

I swear that AI could independently develop a cure for cancer and people would still say that it's not actually intelligent, just matrix multiplications giving a statistically probable answer!

LLMs are at least designed to be intelligent. Our monkey brains have much less reason to be intelligent, since we only evolved to survive nature, not to understand it.

We are at this moment extremely deep into what most people would have been considered to be actual artificial intelligence a mere 15 years ago. We're not quite at human levels of intelligence, but it's close.

rustyhancock•38m ago

I think a very long time because part of our limit is experiment.

We need enough experimental results to explain to solve these theoretical mismatches and we don't and at present can't explore that frontier.

Once we have more results at that frontier we'd build a theory out from there that has two nearly independent limits for QFT and GR.

What we'd be asking if the AI is something that we can't expect a human to solve even with a lifetime of effort today.

It'll take something in par with Newton realising that the heavens and apples are under the same rules to do it. But at least Newton got to hold the apple and only had to imagine he could a star.

bob1029•12m ago

What prevents us from giving this system access to other real systems that live in physical labs? I don't see much difference between parameterizing and executing a particle accelerator run and invoking some SQL against a provider. It's just JSON on the wire at some level.

rustyhancock•7m ago

Nothing, we can give it all the data we have and have it lead experiments.

But we can not yet experiment at the GR/QFT frontier.

To do so with a particle accelerator it would need to be the size of the milky way.

fragmede•5m ago

The question is, if you trained an LLM on everything up until 1904, could it come up with E=MC² or not?

Pat44113•27m ago

I asked Claude to solve the pentominoes puzzle made famous by Arthur C. Clarke. It struggled mightily until I told it how I'd solved the problem using 64 bit unsigned integers to represent the board and pieces. Then, it created a C# program that solved the problem very quickly. However, in the 20x3 case it found four solutions when there are only two. Turns out it had incorrectly mapped one of the pentominoes. Sort of a silly mistake; the sort a human might make.

phoronixrly•24m ago

Probably because a human made it before their code got harvested and then regurgitated to you in the form of slop...

ontouchstart•22m ago

Fascinating report by DEK himself.

Time to sit down, read, digest and understand it without the help of LLM.

iandanforth•5m ago

TLDR (story, not math) - Knuth poses a problem, his friend uses Claude to conduct 30 some explorations, with careful human guidance, and Claude eventually writes a Python program that can find a solution for all odd values. Knuth then writes a proof of the approach and is very pleased by Claude's contribution. Even values remain an open question.

I'm reluctant to verify my identity or age for any online services

India's top court angry after junior judge cites fake AI-generated orders

The Xkcd thing, now interactive

Meta’s AI smart glasses and data privacy concerns

Claude's Cycles: Claude Opus 4.6 solves a problem posed by Don Knuth [pdf]

Apple Introduces MacBook Pro with All‑New M5 Pro and M5 Max

British Columbia is permanently adopting daylight time

Arm's Cortex X925: Reaching Desktop Performance

Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

I'm losing the SEO battle for my own open source project

Ars Technica fires reporter after AI controversy involving fabricated quotes

The Internet's Top Tech Publications Lost 58% of Their Google Traffic Since 2024

Apple introduces the new MacBook Air with M5

History of the Graphical User Interface: The Rise (and Fall?) Of WIMP Design

We Built a Video Rendering Engine by Lying to the Browser About What Time It Is

Simple screw counter

Computer Says No

Points on a ring: An interactive walkthrough of a popular math problem

Apple unveils new Studio Display and all-new Studio Display XDR

Don't Become an Engineering Manager

C64: Putting Sprite Multiplexing to Work

Disable Your SSH access accidentally with scp

Show HN: React-Kino – Cinematic scroll storytelling for React (1KB core)

Show HN: I built a sub-500ms latency voice agent from scratch

I built a pint-sized Macintosh

DOS Memory Management

Florida public universities to pause hiring new H-1B workers

Physicists developing a quantum computer that’s entirely open source

Mullvad VPN: Banned TV Ad in the Streets of London [video]

First in-utero stem cell therapy for fetal spina bifida repair is safe: study

I'm reluctant to verify my identity or age for any online services

India's top court angry after junior judge cites fake AI-generated orders

The Xkcd thing, now interactive

Meta’s AI smart glasses and data privacy concerns

Claude's Cycles: Claude Opus 4.6 solves a problem posed by Don Knuth [pdf]

Apple Introduces MacBook Pro with All‑New M5 Pro and M5 Max

British Columbia is permanently adopting daylight time

Arm's Cortex X925: Reaching Desktop Performance

Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

I'm losing the SEO battle for my own open source project

Ars Technica fires reporter after AI controversy involving fabricated quotes

The Internet's Top Tech Publications Lost 58% of Their Google Traffic Since 2024

Apple introduces the new MacBook Air with M5

History of the Graphical User Interface: The Rise (and Fall?) Of WIMP Design

We Built a Video Rendering Engine by Lying to the Browser About What Time It Is

Simple screw counter

Computer Says No

Points on a ring: An interactive walkthrough of a popular math problem

Apple unveils new Studio Display and all-new Studio Display XDR

Don't Become an Engineering Manager

C64: Putting Sprite Multiplexing to Work

Disable Your SSH access accidentally with scp

Show HN: React-Kino – Cinematic scroll storytelling for React (1KB core)

Show HN: I built a sub-500ms latency voice agent from scratch

I built a pint-sized Macintosh

DOS Memory Management

Florida public universities to pause hiring new H-1B workers

Physicists developing a quantum computer that’s entirely open source

Mullvad VPN: Banned TV Ad in the Streets of London [video]

First in-utero stem cell therapy for fetal spina bifida repair is safe: study

Claude's Cycles: Claude Opus 4.6 solves a problem posed by Don Knuth [pdf]

Comments