The Illusion of Thinking: Understanding the Limitations of Reasoning LLMs [pdf]

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

123•amrrs•9h ago

Comments

behnamoh•5h ago

Okay Apple, you got my attention. But I'm a strong proponent of "something is better than nothing" philosophy—even if OpenAI/Google/etc. are building reasoning models with the limitations that you describe, they are still a huge progress compared to what we had not long ago. Meanwhile you're not even trying.

It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".

suddenlybananas•5h ago

I think you're mistaking the work of researchers who work at Apple with the particular investment decisions of Apple over the past few years.

>It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".

This is a patently absurd thing to write about a research paper.

bwfan123•3h ago

there is enough hype already - with AGI being promised as imminent.

this work balances the hype and shows fundamental limitations so the AI hypesters are checked.

why be salty ?

ivape•5h ago

This is easily explained by accepting that there is no such thing as LRMs. LRMs are just LLMs that iterate on its own answers more (or provides itself more context information of a certain type). The reasoning loop on an "LRM" will be equivalent to asking a regular LLM to "refine" its own response, or "consider" additional context of a certain type. There is no such thing as reasoning basically, as it was always a method to "fix" hallucinations or provide more context automatically, nothing else. These big companies baked in one of the hackiest prompt engineering tricks that your typical enthusiast figured out long ago and managed to brand it and profit off it. The craziest part about this was Deepseek was able to cause a multi billion dollar drop and pump of AI stocks with this one trick. Crazy times.

AlienRobot•3h ago

Is that what "reasoning" means? That sounds pretty ridiculous.

I've thought before that AI is as "intelligent" as your smartphone is "smart," but I didn't think "reasoning" would be just another buzzword.

ngneer•4m ago

I am not too familiar with the latest hype, but "reasoning" has a very straightforward definition in my mind. For example, can the program in question derive new facts from old ones in a logically sound manner. Things like applying modus ponens. (A and A => B) => B. Or, all men are mortal and Socrates is a man, and therefore Socrates is mortal. If the program cannot deduce new facts, then it is not reasoning, at least not by my definition.

meroes•3h ago

Yep. This is exactly the conclusion I reached as an RLHF'er. Reasoning/LRM/SxS/CoT is "just" more context. There never was reasoning. But of course, more context can be good.

JusticeJuice•4h ago

Their finding of LLMs working best at simple tasks, LRMs working best at medium complexity tasks, and then neither succeeding at actually complex tasks is good to know.

cubefox•2h ago

Not sure whether I sense sarcasm.

nialv7•4h ago

I've seen this too often, papers that ask questions they don't even bother to properly define.

> Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?

Define reasoning, define generalizable, define pattern matching.

For additional credits after you have done so, show humans are capable of what you just defined as generalizable reasoning.

beneboy•3h ago

This kind of explains why Claude will find the right solution, but then the more it thinks and keeps “improving” the more over-engineered (and sometimes wrong) the solution is. Interesting to see this coming up in formal research.

bicepjai•3h ago

The study challenges the assumption that more “thinking” or longer reasoning traces necessarily lead to better problem-solving in LRMs

bayindirh•3h ago

As a test, I asked Gemini 2.5 Flash and Gemini 2.5 Pro to decode a single BASE64 string.

Flash answered correctly in ~2 seconds, at most. Pro answered very wrongly after thinking and elaborating for ~5 minutes.

Flash was also giving a wrong answer for the same string in the past, but it improved.

Prompt was the same: "Hey, can you decode $BASE64_string?"

I have no further comments.

actinium226•3h ago

Man, remember when everyone was like 'AGI just around the corner!' Funny how well the Gartner hype cycle captures these sorts of things

bayindirh•3h ago

They're similar to self-driving vehicles. Both are around the corner, but neither can negotiate the turn.

einrealist•3h ago

All that to keep the investment pyramid schemes going.

yahoozoo•3h ago

We will be treating LLMs “like a junior developer” forever.

JKCalhoun•3h ago

And I'm fine with that.

sneak•1h ago

Even if they never get better than they are today (unlikely) they are still the biggest change in software development and the software development industry in my 28 year career.

tonyhart7•3h ago

I think we just around at 80% of progress

the easy part is done but the hard part is so hard it takes years to progress

roenxi•2h ago

What do you think has changed? The situation is still about as promising for AGI in a few years - if not more so. Papers like this are the academics mapping out where the engineering efforts need to be directed to get there and it seems to be a relatively small number of challenges that are easier as the ones already overcome - we know machine learning can solve Towers of Hanoi, for example. It isn't fundamentally complicated like Baduk is. The next wall to overcome is more of a low fence.

Besides, AI already passes the Turing test (or at least, is most likely to fail because it is too articulate and reasonable). There is a pretty good argument we've already achieved AGI and now we're working on achieving human- and superhuman-level intelligence in AGI.

MoonGhost•1h ago

> What do you think has changed? The situation is still about as promising for AGI in a few years - if not more so

It's better today. Hoping that LLMs can get us to AGI in one hop was naive. Depending on definition of AGI we might be already there. But for superhuman level in all possible tasks there are many steps to be done. The obvious way is to find a solution for each type of tasks. We have already for math calculations, it's using tools. Many other types can be solved the same way. After a while we'll gradually get to well rounded 'brain', or model(s) + support tools.

So, so far future looks bright, there is progress, problems, but not deadlocks.

PS: Turing test is a <beep> nobody seriously talks about today.

latchup•2h ago

To be fair, the technology sigmoid curve rises fastest just before its inflection point, so it is hard to predict at what point innovation slows down due to its very nature.

The first Boeing 747 was rolled out in 1968, only 65 years after the first successful heavier-than-air flight. If you told people back then that not much will fundamentally change in civil aviation over the next 57 years, no one would have believed you.

brookst•2h ago

…but that was, like, two years ago? If we go from GPT2 to AGI in ten years that will still feel insanely fast.

alansammarone•3h ago

I have a somewhat similar point of view to the one voiced by other people, but I like to think about it slightly differently, so I'll chime in - here's my take (although, admittedly, I'm operating with a quite small reasoning budget (5 minutes tops)):

Time and again, for centuries - with the pace picking up dramatically in recent decades - we thought we were special and we were wrong. Sun does not rotate around the earth, which is a pretty typical planet, with the same chemical composition of any other planet. All of a sudden we're not the only ones who could calculate, then solve symbolic equations, then play chess, then compose music, then talk, then reason (up to a point, for some definition of "reason"). You get my point.

And when we were not only matched, but dramatically surpassed in these tasks (and not a day earlier), we concluded that they weren't _really_ what made us special.

At this point, it seems to me reasonable to assume we're _not_ special, and the onus should be on anybody claiming that we are to at least attempt to mention in passing what is the secret sauce that we have (even if we can't quite say what it is without handwaving or using concepts that by definition can not be defined - "qualia is the indescribable feeling of red - its redness (?)).

Oh, and sorry, I could never quite grasp what "sentient" is supposed to mean - would we be able to tell we're not sentient if we weren't?

ivape•2h ago

I can give you a pretty wild explanation. Einstein was a freak of nature. Nature just gave him that "something" to figure out the laws of the universe. I'm avoiding the term God as to not tickle anyone incorrectly. Seriously, explain what schooling and environment gets you that guy. So, to varying degrees, all output is from the universe. It's hard for the ego to accept, surely we earned everything we ever produced ...

Spooky stuff.

curious_cat_163•3h ago

> Rather than standard benchmarks (e.g., math problems), we adopt controllable puzzle environments that let us vary complexity systematically

Very clever, I must say. Kudos to folks who made this particular choice.

> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.

This is fascinating! We need more "mapping" of regimes like this!

What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.

For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.

8bitsrule•2h ago

Fusion has been 25 years away for all of my life.

sneak•1h ago

Fusion is net positive energy now; that happened in 2022 (+54%).

In 2025 they got a 313% gain (4.13 output factor).

Fusion is actually here and working. It’s not cost effective yet but to pretend there has been no progress or achievements is fundamentally false.

oneshtein•1h ago

It will be cost effective in just 25 years.

sitkack•1h ago

Negative Negs spit out low effort snark, they said the same thing about solar, electric cars, even multicore, jit, open source. Thanks for refuting them, the forum software itself should either quarantine the response or auto respond before the comment is submitted. These people don't build the future.

Fusion News, May 28th, 2025 https://www.youtube.com/watch?v=1YHcI-SfKx8

lrhegeba•1h ago

It isnt when you look at Q total. Total energy input for all needed support systems versus energy produced. See https://en.wikipedia.org/wiki/Fusion_energy_gain_factor for more details

benlivengood•2h ago

These are the kind of studies that make so much more sense than the "LLMs can't reason because of this ideological argument or this one anecdote" posts/articles. Keep 'em coming!

And also; the frontier LLMs blow older LLMs out of the water. There is continual progress and this study would have been structured substantially the same 2 years ago with much smaller N on the graphs because the regimes were much tinier then.

antics•1h ago

I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted. And the question they are collectively trying to ask is whether this will continue forever.

I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.

But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.

sitkack•1h ago

There is no reason that omniscient-yet-dim-witted has to plateau at human intelligence.

antics•1h ago

I am not sure if you mean this to refute something in what I've written but to be clear I am not arguing for or against what the authors think. I'm trying to state why I think there is a disconnect between them and more optimistic groups that work on AI.

drodgers•1h ago

I think that commenter was disagreeing with this line:

> because omniscient-yet-dim-witted models terminate at "superhumanly assistive"

It might be that with dim wits + enough brute force (knowledge, parallelism, trial-and-error, specialisation, speed) models could still substitute for humans and transform the economy in short order.

drodgers•1h ago

> I think AI maximalists will continue to think that the models are in fact getting less dim-witted

I'm bullish (and scared) about AI progress precisely because I think they've only gotten a little less dim-witted in the last few years, but their practical capabilities have improved a lot thanks to better knowledge, taste, context, tooling etc.

What scares me is that I think there's a reasoning/agency capabilities overhang. ie. we're only one or two breakthroughs away from something which is both kinda omniscient (where we are today), and able to out-think you very quickly (if only through dint of applying parallelism to actually competent outcome-modelling and strategic decision making).

That combination is terrifying. I don't think enough people have really imagined what it would mean for an AI to be able to out-strategise humans (in terms of speed and quality) in the same way that they can now — say — out-poetry humans (by being both decent in terms of quality and super fast). It's like when you're speaking to someone way smarter than you and you realise that they're 6 steps ahead, and actively shaping your thought process to guide you where they want you to end up. At scale. For everything.

This exact thing (better reasoning + agency) is also the top priority for all of the frontier researchers right now (because it's super useful), so I think a breakthrough might not be far away.

Another way to phrase it: I think today's LLMs are about as good at snap judgements in most areas as the best humans (probably much better at everything that rhymes with inferring vibes from text), but they kinda suck at:

1. Reasoning/strategising step-by-step for very long periods

2. Snap judgements about reasoning or taking strategic actions (in the way that expert strategic humans don't actually need to think through their actions step-by-step very often - they've built intuition which gets them straight to the best answer 90% of the time)

Getting good at the long range thinking might require more substantial architectural changes (eg. some sort of separate 'system 2' reasoning architecture to complement the already pretty great 'system 1' transformer models we have). OTOH, it might just require better training data and algorithms so that the models develop good enough strategic taste and agentic intuitions to get to a near-optimal solution quickly before they fall off a long-range reasoning performance cliff.

Of course, maybe the problem is really hard and there's no easy breakthrough (or it requires 100,000x more computing power than we have access to right now). There's no certainty to be found, but a scary breakthrough definitely seems possible to me.

sitkack•54m ago

I think you are right, and that the next step function can be achieved using the models we have, either by scaling the inference, or changing the way inference is done.

esafak•1h ago

I don't know that I would call it an "illusion of thinking", but LLMs do have limitations. Humans do too. No amount of human thinking has solved numerous open problems.

th0ma5•1h ago

The errors that LLMs make and the errors that people make are not probably not comparable enough in a lot of the discussions about LLM limitations at this point?

esafak•1h ago

We have different failure modes. And I'm sure researchers, faced with these results, will be motivated to overcome these limitations. This is all good, keep it coming. I just don't understand the some of the naysaying here.

A Novel "Reasoning"-Enhancing Technique for Large Language Models

Astonishing discovery by computer scientist: how to squeeze space into time [video]

Show HN: Resumable Web Streams

AMC Says It Will Show More Ads Before Movies

Getting C++ Hello World working on Windows (a comedy & tragedy)

NASA delays next flight of Boeing's alternative to SpaceX Dragon

Can Schrodinger's Cat Factor Numbers?

NASA Delays Next Flight of Boeing's Alternative to SpaceX Dragon

California AG vows crack down on copper wire thefts in the state

Show HN: A photo backup idea – to your own storage, not iCloud/Google

Trump administration races to fix a big mistake: DOGE fired too many people

Getting Past Procastination

Reverse Engineering Cursor's LLM Client

Show HN: Cpdown – Copy any webpage/YouTube subtitle as clean Markdown(LLM-ready)

Pentagon Disinformation Fueled America's UFO Mythology

Open-source code repos open to supply chain attacks, researchers warn

Ask HN: What non-AI projects are you working on?

Nintendo Switch 2 Teardown [video]

TSA urges people to stop trying to use a Costco card as a sufficient Real ID

The reason Indians are lost

Ask HN: Why are job descriptions and resumes so bad?

Show HN: Pcrassist.com – AI powered report assistant for EMTs

Error Monads the Hard Way

Show HN: C++ SFML Game Engine for Nintendo Switch, Web (HTML5), PC and Mobile

Musk's XAI Is Trying to Borrow $5B While His Relationship with Trump Blows Up

We Should Immediately Nationalize SpaceX and Starlink

ACLU sues Sonoma County, alleges illegal drone surveillance program

Show HN: Email Scraper for Instagram

A New System Aims to Save Injured Brains and Lives

How to Turn an Acquaintance into a Friend