Is chain-of-thought AI reasoning a mirage?

https://www.seangoedecke.com/real-reasoning/

83•ingve•4h ago

Comments

NitpickLawyer•3h ago

Finally! A good take on that paper. I saw that arstechnica article posted everywhere, and most of the comments are full of confirmation bias, and almost all of them miss the fineprint - it was tested on a 4 layer deep toy model. It's nice to read a post that actually digs deeper and offers perspectives on what might be a good finding vs. just warranting more research.

stonemetal12•1h ago

> it was tested on a 4 layer deep toy model

How do you see that impacting the results? It is the same algorithm just on a smaller scale. I would assume a 4 layer model would not be very good, but does reasoning improve it? Is there a reason scale would impact the use of reasoning?

okasaki•1h ago

Human babies are the same algorithm as adults.

azrazalea_debt•1h ago

A lot of current LLM work is basically emergent behavior. They use a really simple core algorithm and scale it up, and interesting things happen. You can read some of anthropic's recent papers to see some of this, like: They didn't expect LLMs could "lookahead" when writing poetry. However, when they actually went in and watched what was happening (there's details on how this "watching" works on their blog/in their studies) they found the LLM actually was planning ahead! That's emergent behavior, they didn't design it to do that, it just started doing due to the complexity of the model.

If (BIG if) we ever do see actual AGI, it is likely to work like this. It's unlikely we're going to make AGI by designing some grand Cathedral of perfect software, it is more likely we are going to find the right simple principles to scale big enough to have AGI emerge. This is similar.

NitpickLawyer•1h ago

There's prior research that finds a connection between model depth and "reasoning" ability - https://arxiv.org/abs/2503.03961

A depth of 4 is very small. It is very much a toy model. It's ok to research this, and maybe someone will try it out on larger models, but it's totally not ok to lead with the conclusion, based on this toy model, IMO.

sempron64•2h ago

Betteridge's Law of Headlines.

https://en.m.wikipedia.org/wiki/Betteridge's_law_of_headline...

mwkaufma•2h ago

Betteridge's law applies to editors adding question marks to cover-the-ass of articles with weak claims, not bloggers begging questions.

robviren•2h ago

I feel it is interesting but not what would be ideal. I really think if the models could be less linear and process over time in latent space you'd get something much more akin to thought. I've messed around with attaching reservoirs at each layer using hooks with interesting results (mainly over fitting), but it feels like such a limitation to have all model context/memory stuck as tokens when latent space is where the richer interaction lives. Would love to see more done where thought over time mattered and the model could almost mull over the question a bit before being obligated to crank out tokens. Not an easy problem, but interesting.

dkersten•2h ago

Agree! I’m not an AI engineer or researcher, but it always struck me as odd that we would serialise the 100B or whatever parameters of latent space down to maximum 1M tokens and back for every step.

vonneumannstan•2h ago

>I feel it is interesting but not what would be ideal. I really think if the models could be less linear and process over time in latent space you'd get something much more akin to thought.

Please stop, this is how you get AI takeovers.

adastra22•1h ago

Citation seriously needed.

CuriouslyC•1h ago

They're already implementing branching thought and taking the best one, eventually the entire response will be branched, with branches being spawned and culled by some metric over the lifetime of the completion. It's just not feasible now for performance reasons.

mentalgear•2h ago

> Whether AI reasoning is “real” reasoning or just a mirage can be an interesting question, but it is primarily a philosophical question. It depends on having a clear definition of what “real” reasoning is, exactly.

It's pretty easy: causal reasoning. Causal, not statistic correlation only as LLM do, with or without "CoT".

naasking•2h ago

Define causal reasoning?

glial•2h ago

Correct me if I'm wrong, I'm not sure it's so simple. LLMs are called causal models in the sense that earlier tokens "cause" later tokens, that is, later tokens are causally dependent on what the earlier tokens are.

If you mean deterministic rather than probabilistic, even Pearl-style causal models are probabilistic.

I think the author is circling around the idea that their idea of reasoning is to produce statements in a formal system: to have a set of axioms, a set of production rules, and to generate new strings/sentences/theorems using those rules. This approach is how math is formalized. It allows us to extrapolate - make new "theorems" or constructions that weren't in the "training set".

jayd16•2h ago

By this definition a bag of answers is causal reasoning because we previously filled the bag, which caused what we pulled. State causing a result is not causal reasoning.

You need to actually have something that deduces a result from a set of principles that form a logical conclusion or the understanding that more data is needed to make a conclusion. That is clearly different than finding a likely next token on statics alone, despite the fact the statical answer can be correct.

apples_oranges•2h ago

But let's say you change your mathematical expression by reducing or expanding it somehow, then, unless it's trivial, there are infinite ways to do it, and the "cause" here is the answer to the question of "why did you do that and not something else"? Brute force excluded, the cause is probably some idea, some model of the problem or a gut feeling (or desperation..) ..

stonemetal12•1h ago

Smoking increases the risk of getting cancer significantly. We say Smoking causes Cancer. Causal reasoning can be probabilistic.

LLMs are not causal reasoning because there are no facts, only tokens. For the most part you can't ask LLMs how they came to an answer, because it doesn't know.

lordnacho•1h ago

What's stopping us from building an LLM that can build causal trees, rejecting some trees and accepting others based on whatever evidence it is fed?

Or even a causal tool for an LLM agent that operates like what it does when you ask it about math and forwards the request to Wolfram.

suddenlybananas•1h ago

>What's stopping us from building an LLM that can build causal trees, rejecting some trees and accepting others based on whatever evidence it is fed?

Exponential time complexity.

mdp2021•1h ago

> causal reasoning

You have missed the foundation: before dynamics, being. Before causal reasoning you have deep definition of concepts. Causality is "below" that.

empath75•2h ago

One thing that LLMs have exposed is how much of a house of cards all of our definitions of "human mind"-adjacent concepts are. We have a single example in all of reality of a being that thinks like we do, and so all of our definitions of thinking are inextricably tied with "how humans think", and now we have an entity that does things which seem to be very like how we think, but not _exactly like it_, and a lot of our definitions don't seem to work any more:

Reasoning, thinking, knowing, feeling, understanding, etc.

Or at the very least, our rubrics and heuristics for determining if someone (thing) thinks, feels, knows, etc, no longer work. And in particular, people create tests for those things thinking that they understand what they are testing for, when _most human beings_ would also fail those tests.

I think a _lot_ of really foundational work needs to be done on clearly defining a lot of these terms and putting them on a sounder basis before we can really move forward on saying whether machines can do those things.

gdbsjjdn•2h ago

Congratulations, you've invented philosophy.

empath75•2h ago

This is an obnoxious response. Of course I recognize that philosophy is the solution to this. What I am pointing out is that philosophy has not as of yet resolved these relatively new problems. The idea that non-human intelligences might exist is of course an old one, but that is different from having an actual (potentially) existing one to reckon with.

adastra22•1h ago

These are not new problems though.

deadbabe•1h ago

Non-human intelligences have always existed in the form of animals.

Animals do not have spoken language the way humans do, so their thoughts aren’t really composed of sentences. Yet, they have intelligence and can reason about their world.

How could we build an AGI that doesn’t use language to think at all? We have no fucking clue and won’t for a while because everyone is chasing the mirage created by LLMs. AI winter will come and we’ll sit around waiting for the next big innovation. Probably some universal GOAP with deeply recurrent neural nets.

gdbsjjdn•57m ago

> Writings on metacognition date back at least as far as two works by the Greek philosopher Aristotle (384–322 BC): On the Soul and the Parva Naturalia

We built a box that spits out natural language and tricks humans into believing it's conscious. The box itself actually isn't that interesting, but the human side of the equation is.

mdp2021•44m ago

> the human side of the equation is

You have only proven the urgency of Intelligence, the need to produce it in inflationary amounts.

meindnoch•1h ago

We need to reinvent philosophy. With JSON this time.

mdp2021•1h ago

> which seem to be very like how we think

I would like to reassure you that we - we here - see LLMs are very much unlike us.

empath75•1h ago

Yes I very much understand that most people do not think that LLMs think or understand like we do, but it is _very difficult_ to prove that that is the case, using any test which does not also exclude a great deal of people. And that is because "thinking like we do" is not at all a well-defined concept.

mdp2021•1h ago

> exclude a great deal of people

And why should you not exclude them. Where does this idea come from, taking random elements as models. Where do you see pedestals of free access? Is the Nobel Prize a raffle now?

naasking•2h ago

> Because reasoning tasks require choosing between several different options. “A B C D [M1] -> B C D E” isn’t reasoning, it’s computation, because it has no mechanism for thinking “oh, I went down the wrong track, let me try something else”. That’s why the most important token in AI reasoning models is “Wait”. In fact, you can control how long a reasoning model thinks by arbitrarily appending “Wait” to the chain-of-thought. Actual reasoning models change direction all the time, but this paper’s toy example is structurally incapable of it.

I think this is the most important critique that undercuts the paper's claims. I'm less convinced by the other point. I think backtracking and/or parallel search is something future papers should definitely look at in smaller models.

The article is definitely also correct on the overreaching, broad philosophical claims that seems common when discussing AI and reasoning.

mucho_mojo•2h ago

This paper I read from here has an interesting mathematical model for reasoning based on cognitive science. https://arxiv.org/abs/2506.21734 (there is also code here https://github.com/sapientinc/HRM) I think we will see dramatic performance increases on "reasoning" problems when this is worked into existing AI architectures.

stonemetal12•1h ago

When Using AI they say "Context is King". "Reasoning" models are using the AI to generate context. They are not reasoning in the sense of logic, or philosophy. Mirage, whatever you want to call it, it is rather unlike what people mean when they use the term reasoning. Calling it reasoning is up there with calling generating out put people don't like hallucinations.

adastra22•1h ago

You are making the same mistake OP is calling out. As far as I can tell “generating context” is exactly what human reasoning is too. Consider the phrase “let’s reason this out” where you then explore all options in detail, before pronouncing your judgement. Feels exactly like what the AI reasoner is doing.

stonemetal12•1h ago

"let's reason this out" is about gathering all the facts you need, not just noting down random words that are related. The map is not the terrain, words are not facts.

energy123•1h ago

Performance is proportional to the number of reasoning tokens. How to reconcile that with your opinion that they are "random words"?

kelipso•1h ago

Technically random can have probabilities associated with them.. Casual speech, random means equal probabilities, or we don’t know the probabilities. But for LLM token output, it does estimate the probabilities.

blargey•24m ago

s/random/statistically-likely/g

Reducing the distance of each statistical leap improves “performance” since you would avoid failure modes that are specific to the largest statistical leaps, but it doesn’t change the underlying mechanism. Reasoning models still “hallucinate” spectacularly even with “shorter” gaps.

ikari_pl•17m ago

What's wrong with statistically likely?

If I ask you what's 2+2, there's a single answer I consider much more likely than others.

Sometimes, words are likely because they are grounded in ideas and facts they represent.

mdp2021•1h ago

But a big point here becomes whether the generated "context" then receives proper processing.

slashdave•1h ago

Perhaps we can find some objective means to decide, rather than go with what "feels" correct

phailhaus•1h ago

Feels like, but isn't. When you are reasoning things out, there is a brain with state that is actively modeling the problem. AI does no such thing, it produces text and then uses that text to condition the next text. If it isn't written, it does not exist.

Put another way, LLMs are good at talking like they are thinking. That can get you pretty far, but it is not reasoning.

double0jimb0•1h ago

So exactly what language/paradigm is this brain modeling the problem within?

phailhaus•1h ago

We literally don't know. We don't understand how the brain stores concepts. It's not necessarily language: there are people that do not have an internal monologue, and yet they are still capable of higher level thinking.

chrisweekly•1h ago

Rilke: "There is a depth of thought untouched by words, and deeper still a depth of formless feeling untouched by thought."

Enginerrrd•41m ago

The transformer architecture absolutely keeps state information "in its head" so to speak as it produces the next word prediction, and uses that information in its compute.

It's true that if it's not producing text, there is no thinking involved, but it is absolutely NOT clear that the attention block isn't holding state and modeling something as it works to produce text predictions. In fact, I can't think of a way to define it that would make that untrue... unless you mean that there isn't a system wherein something like attention is updating/computing and the model itself chooses when to make text predictions. That's by design, but what you're arguing doesn't really follow.

Now, whether what the model is thinking about inside that attention block matches up exactly or completely with the text it's producing as generated context is probably at least a little dubious, and its unlikely to be a complete representation regardless.

kelipso•1h ago

No, people make logical connections, make inferences, make sure all of it fits together without logical errors, etc.

pixl97•58m ago

These people you're talking about must be rare online, as human communication is pretty rife with logical errors.

mdp2021•45m ago

Since that November in which this technology boomed we have been much too often reading "people also drink from puddles", as if it were standard practice.

That we implement skills, not deficiencies, is a basic concept that is getting to such a level of needed visibility it should probably be inserted in the guidelines.

We implement skills, not deficiencies.

kelipso•7m ago

You shouldn’t be basing your entire worldview around the lowest common denominator. All kinds of writers like blog writers, novelists, scriptwriters, technical writers, academics, poets, lawyers, philosophers, mathematicians, and even teenage fan fiction writers do what I said above routinely.

viccis•35m ago

>As far as I can tell “generating context” is exactly what human reasoning is too.

This was the view of Hume (humans as bundles of experience who just collect information and make educated guesses for everything). Unfortunately, it leads to philosophical skepticism, in which you can't ground any knowledge absolutely, as it's all just justified by some knowledge you got from someone else, which also came from someone else, etc., and eventually you can't actually justify any knowledge that isn't directly a result of experience (the concept of "every effect has a cause" is a classic example).

There have been plenty of epistemological responses to this viewpoint, with Kant's view, of humans doing a mix of "gathering context" (using our senses) but also applying universal categorical reasoning to schematize and understand / reason from the objects we sense, being the most well known.

I feel like anyone talking about the epistemology of AI should spend some time reading the basics of all of the thought from the greatest thinkers on the subject in history...

bongodongobob•1h ago

And yet it improves their problem solving ability.

ofjcihen•37m ago

It’s incredible to me that so many seem to have fallen for “humans are just LLMs bruh” argument but I think I’m beginning to understand the root of the issue.

People who only “deeply” study technology only have that frame of reference to view the world so they make the mistake of assuming everything must work that way, including humans.

If they had a wider frame of reference that included, for example, Early Childhood Development, they might have enough knowledge to think outside of this box and know just how ridiculous that argument is.

cyanydeez•16m ago

They should call them Fuzzing models. They're just running through varioous iterations of the context until they hit a token that trips them out.

benreesman•11m ago

People will go to extremely great lengths to debate the appropriate analogy for how these things work, which is fun I guess but in a "get high with a buddy" sense at least to my taste.

Some of how they work is well understood (a lot now, actually), some of the outcomes are still surprising.

But we debate both the well understood parts and the surprising parts both with the wrong terminology borrowed from pretty dubious corners of pop cognitive science, and not with terminology appropriate to the new and different thing! It's nothing like a brain, it's a new different thing. Does it think or reason? Who knows pass the blunt.

They do X performance on Y task according to Z eval, that's how you discuss ML model capability if you're persuing understanding rather than fundraising or clicks.

moc_was_wronged•1h ago

Mostly. It gives language models the way to dynamically allocate computation time, but the models are still fundamentally imitative.

modeless•1h ago

"The question [whether computers can think] is just as relevant and just as meaningful as the question whether submarines can swim." -- Edsger W. Dijkstra, 24 November 1983

mdp2021•1h ago

But the topic here is whether some techniques are progressive or not

(with a curious parallel about whether some paths in thought are dead-ends - the unproductive focus mentioned in the article).

griffzhowl•1h ago

I don't agree with the parallel. Submarines can move through water - whether you call that swimming or not isn't an interesting question, and doesn't illuminate the function of a submarine.

With thinking or reasoning, there's not really a precise definition of what it is, but we nevertheless know that currently LLMs and machines more generally can't reproduce many of the human behaviours that we refer to as thinking.

The question of what tasks machines can currently accomplish is certainly meaningful, if not urgent, and the reason LLMs are getting so much attention now is that they're accomplishing tasks that machines previously couldn't do.

To some extent there might always remain a question about whether we call what the machine is doing "thinking" - but that's the uninteresting verbal question. To get at the meaningful questions we might need a more precise or higher resolution map of what we mean by thinking, but the crucial element is what functions a machine can perform, what tasks it can accomplish, and whether we call that "thinking" or not doesn't seem important.

Maybe that was even Dijkstra's point, but it's hard to tell without context...

wizzwizz4•25m ago

https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD898... provides the context. I haven't re-read it in the last month, but I'm pretty sure you've correctly identified Dijkstra's point.

skybrian•1h ago

Mathematical reasoning does sometimes require correct calculations, and if you get them wrong your answers will be wrong. I wouldn’t want someone doing my taxes to be bad at calculation or bad at finding mistakes in calculation.

It would be interesting to see if this study’s results can be reproduced in a more realistic setting.

slashdave•1h ago

> reasoning probably requires language use

The author has a curious idea of what "reasoning" entails.

sixdimensional•1h ago

I feel like the fundamental concept of symbolic logic[1] as a means of reasoning fits within the capabilities of LLMs.

Whether it's a mirage or not, the ability to produce a symbolically logical result that has valuable meaning seems real enough to me.

Especially since most meaning is assigned by humans onto the world... so too can we choose to assign meaning (or not) to the output of a chain of symbolic logic processing?

Edit: maybe it is not so much that an LLM calculates/evaluates the result of symbolic logic as it is that it "follows" the pattern of logic encoded into the model.

[1] https://en.wikipedia.org/wiki/Logic

lawrence1•1h ago

we should be asking if reasoning while speaking is even possible for humans. this is why we have the scientific method and that's why LLMs write and run unit tests on their reasoning. But yeah intelligence is probably in the ear of the believer.

hungmung•51m ago

Chain of thought is just a way of trying to squeeze more juice out of the lemon of LLM's; I suspect we're at the stage of running up against diminishing returns and we'll have to move to different foundational models to see any serious improvement.

brunokim•31m ago

I'm unconvinced by the article criticism's, given they also employ their feels and few citations.

> I appreciate that research has to be done on small models, but we know that reasoning is an emergent capability! (...) Even if you grant that what they’re measuring is reasoning, I am profoundly unconvinced that their results will generalize to a 1B, 10B or 100B model.

A fundamental part of applied research is simplifying a real-world phenomenon to better understand it. Dismissing that for this many parameters, for such a simple problem, the LLM can't perform out of distribution just because it's not big enough undermines the very value of independent research. Tomorrow another model with double the parameters may or may not show the same behavior, but that finding will be built on top of this one.

Also, how do _you_ know that reasoning is emergent, and not rationalising on top of a compressed version of the web stored in 100B parameters?

ActionHank•6m ago

I think that when you are arguing logic and reason with a group who became really attached to the term vibe-coding you've likely already lost.

LudwigNagasena•11m ago

> The first is that reasoning probably requires language use. Even if you don’t think AI models can “really” reason - more on that later - even simulated reasoning has to be reasoning in human language.

That is an unreasonable assumption. In case of LLMs it seems wasteful to transform a point from latent space into a random token and lose information. In fact, I think in near future it will be the norm for MLLMs to "think" and "reason" without outputting a single "word".

It is not a "philosophical" (by which the author probably meant "practically inconsequential") question. If the whole reasoning business is just rationalization of pre-computed answers or simply a means to do some computations because every token provides only a fixed amount of computation to update the model's state, then it doesn't make much sense to focus on improving the quality of chain-of-thought output from human POV.

kazinator•3m ago

[delayed]

skywhopper•7m ago

I mostly agree with the point the author makes that "it doesn't matter". But then again, it does matter, because LLM-based products are marketed based on "IT CAN REASON!" And so, while it may not matter, per se, how an LLM comes up with its results, to the extent that people choose to rely on LLMs because of marketing pitches, it's worth pushing back on those claims if they are overblown, using the same frame that the marketers use.

That said, this author says this question of whether models "can reason" is the least interesting thing to ask. But I think the least interesting thing you can do is to go around taking every complaint about LLM performance and saying "but humans do the exact same thing!" Which is often not true, but again, doesn't matter.

Gemma 3 270M: The compact model for hyper-efficient AI

Blood oxygen monitoring returning to Apple Watch in the US

New protein therapy shows promise as antidote for carbon monoxide poisoning

Meta appoints anti-LGBTQ+ conspiracy theorist Robby Starbuck as AI bias advisor

Kodak has no plans to cease, go out of business, or file for bankruptcy

Bluesky: Updated Terms and Policies

What's the strongest AI model you can train on a laptop in five minutes?

Axle (YC S22) Is Hiring Product Engineers

Launch HN: Cyberdesk (YC S25) – Automate Windows legacy desktop apps

Arch shares its wiki strategy with Debian

Jujutsu and Radicle

Brilliant illustrations bring this 1976 Soviet edition of 'The Hobbit' to life (2015)

Org-social is a decentralized social network that runs on an Org Mode

Architecting LARGE software projects [video]

NSF and Nvidia award Ai2 $152M to support building an open AI ecosystem

Show HN: Zig-DbC – A design by contract library for Zig

Meta accessed women's health data from Flo app without consent, says court

Show HN: I built a free alternative to Adobe Acrobat PDF viewer

SIMD Binary Heap Operations

Is chain-of-thought AI reasoning a mirage?

Funding Open Source like public infrastructure

Linux Address Space Isolation Revived After Lowering 70% Performance Hit to 13%

Zenobia Pay – A mission to build an alternative to high-fee card networks

JetBrains working on higher-abstraction programming language

Show HN: Yet another memory system for LLMs

KosmicKrisp a Vulkan on Metal Mesa 3D Graphics Driver

Why LLMs can't really build software

Show HN: XR2000: A science fiction programming challenge

Launch HN: Golpo (YC S25) – AI-generated explainer videos

Convo-Lang: LLM Programming Language and Runtime

Gemma 3 270M: The compact model for hyper-efficient AI

Blood oxygen monitoring returning to Apple Watch in the US

New protein therapy shows promise as antidote for carbon monoxide poisoning

Meta appoints anti-LGBTQ+ conspiracy theorist Robby Starbuck as AI bias advisor

Kodak has no plans to cease, go out of business, or file for bankruptcy

Bluesky: Updated Terms and Policies

What's the strongest AI model you can train on a laptop in five minutes?

Axle (YC S22) Is Hiring Product Engineers

Launch HN: Cyberdesk (YC S25) – Automate Windows legacy desktop apps

Arch shares its wiki strategy with Debian

Jujutsu and Radicle

Brilliant illustrations bring this 1976 Soviet edition of 'The Hobbit' to life (2015)

Org-social is a decentralized social network that runs on an Org Mode

Architecting LARGE software projects [video]

NSF and Nvidia award Ai2 $152M to support building an open AI ecosystem

Show HN: Zig-DbC – A design by contract library for Zig

Meta accessed women's health data from Flo app without consent, says court

Show HN: I built a free alternative to Adobe Acrobat PDF viewer

SIMD Binary Heap Operations

Is chain-of-thought AI reasoning a mirage?

Funding Open Source like public infrastructure

Linux Address Space Isolation Revived After Lowering 70% Performance Hit to 13%

Zenobia Pay – A mission to build an alternative to high-fee card networks

JetBrains working on higher-abstraction programming language

Show HN: Yet another memory system for LLMs

KosmicKrisp a Vulkan on Metal Mesa 3D Graphics Driver

Why LLMs can't really build software

Show HN: XR2000: A science fiction programming challenge

Launch HN: Golpo (YC S25) – AI-generated explainer videos

Convo-Lang: LLM Programming Language and Runtime

Is chain-of-thought AI reasoning a mirage?

Comments