The Illusion of “The Illusion of Thinking”

https://www.seangoedecke.com/illusion-of-thinking/

44•jonbaer•8h ago

Comments

davedx•4h ago

I stumbled on the paper the article talks about on /r/LocalLlama (this post: https://www.reddit.com/r/LocalLLaMA/comments/1l6ibwg/when_yo...)

I found this comment to be relevant: "Keep in mind this whitepaper is really just Apple circling the wagons because they have dick for proprietary AI tech."

When you question the source, it really does raise eyebrows, especially as an Apple shareholder: that these Apple employees are busy not working on their own AI programme that's now insanely far behind other big tech companies, but are instead spending their time casting shade on the reasoning models developed at other AI labs.

What's the motivation here, really? The paper itself isn't particularly insightful or ground-breaking.

smitty1e•4h ago

> I found this comment to be relevant: "Keep in mind this whitepaper is really just Apple circling the wagons because they have dick for proprietary AI tech."

Now, if we fed the relevant references into an AI model, would the model offer this as a possible motive for the paper in question?

K0balt•4h ago

Probably.

emp17344•4h ago

The paper was written by highly accomplished ML researchers who don’t have any stake in Apple’s continued success. Framing this peer-reviewed research written by respected authors as “sour grapes” is intellectually dishonest.

dandellion•4h ago

It's easy to argue about the people who write the paper and their incentives. It takes a lot more effort to prove that the data, the procedure or the conclusion in the paper has flaws, and back it up.

android521•4h ago

If they get paid by apple, they have a stake

emp17344•3h ago

This kind of statement isn’t productive. Everyone has a bias. If you don’t believe the paper is valid, I’d like to hear your substantive critique.

reliabilityguy•3h ago

> Framing this peer-reviewed research

How do you know it was peer-reviewed? What venue had accepted this paper for publication?

tough•3h ago

how many papers do apple publish under their own CDN/domains

this was certainly a first for me when i saw it pop on hn the other day

reliabilityguy•3h ago

They publish papers pretty frequently https://machinelearning.apple.com/research?domain=Speech%20a...

Doesn’t mean they are peer reviewed

tough•3h ago

got it thanks

tikhonj•4h ago

The motivation is doing research to better understand AI?

People's time and attention is not fungible—especially in inherently creative pursuits like research—and the mindset in your comment is exactly the sort of superficial administrative reasoning that leads to hype bubbles unconstrained by reality.

"Why are you wasting your time trying to understand what we're doing instead of rushing ahead without thinking" is absolutely something I've heard from managers and executives, albeit phrased more politically, and it never ends well in a holistic accounting.

JimDabell•3h ago

This is ridiculous. If Apple wanted to make competing AI look bad, getting some researchers to publish a critical paper is hardly going to have any kind of worthwhile outcome.

codeflo•4h ago

I'm seriously fed up with all this fact-free AI hype. Whenever an LLM regurgitates training data, it's heralded as the coming of AGI. Whenever it's shown that they can't solve any novel problem, the research is in bad faith (but please make sure to publish the questions so that the next model version can solve them -- of course completely by chance).

Here's a quote from the article:

> How many humans can sit down and correctly work out a thousand Tower of Hanoi steps? There are definitely many humans who could do this. But there are also many humans who can’t. Do those humans not have the ability to reason? Of course they do! They just don’t have the conscientiousness and patience required to correctly go through a thousand iterations of the algorithm by hand. (Footnote: I would like to sit down all the people who are smugly tweeting about this with a pen and paper and get them to produce every solution step for ten-disk Tower of Hanoi.)

In case someone imagines that fancy recursive reasoning is necessary to solve the Towers of Hanoi, here's the algorithm to move 10 (or any even number of) disks from peg A to peg C:

1. Move one disk from peg A to peg B or vice versa, whichever move is legal.

2. Move one disk from peg A to peg C or vice versa, whichever move is legal.

3. Move one disk from peg B to peg C or vice versa, whichever move is legal.

4. Goto 1.

Second-graders can follow that, if motivated enough.

There's now constant, nonstop, obnoxious shouting on every channel about how these AI models have solved the Turing test (one wonders just how stupid these "evaluators" were), are at the level of junior devs (LOL), and actually already have "PhD level" reasoning capabilities.

I don't know who is supposed to be fooled -- we have access to these things, we can try them. One can easily knock out any latest version of GPT-PhD-level-model-of-the-week with a trivial question. Nothing fundamentally changed about that since GPT-2.

The hype and the observable reality are now so far apart that one really has to wonder: Are people this easily gullible? Or do so many people in tech benefit from the hype train that they don't want to rain on the parade?

akoboldfrying•4h ago

> obnoxious shouting on every channel about how these AI models have solved the Turing test (one wonders just how stupid these "evaluators" were)

Huh? Schoolteachers and university professors complaining about being unable to distinguish ChatGPT-written essay answers from student-written essay answers is literally ChatGPT passing the Turing test in real time.

amelius•3h ago

It's a Turning test with human-prefiltered responses at best.

delusional•3h ago

No it's not. The traditional interpretation of the Turing test requires interactivity. That is, the evaluator is allowed to ask questions and will receive a response from both a person and a machine. The idea is that there should be no sequence of questions you can ask that would reliably identify the machine. That's not even close to true for these "AI" systems.

akoboldfrying•3h ago

You're right about interactivity, something that I overlooked -- but I think it's nevertheless the case that a large fraction of human interrogators could not distinguish a human from a suitably-system-prompted ChatGPT even over the course of an interactive discussion.

ChatGPT 4.5 was judged to be the human 73% of the time in this RCT study, where human interrogators had 5-minute conversations with a human and an LLM: https://arxiv.org/pdf/2503.23674

Joeboy•3h ago

This is kind of an irrelevant (and doubtless unoriginal) shower thought here but, if humans are judging the AI to be human much more often than the human, surely that means the AI is not faithfully reproducing human behaviour.

delusional•2h ago

This comes down to the interpretation of the Turing test. Turing's original test actually pitted the two "unknowns" against each other. Put simply, both the human and the computer would try to make you believe they were the person. The objective of the game was to be seen as human, not to be indistinguishable from human.

This is obviously not quite what people understand the Turing test as anymore, and I think that interpretation confusion actually ends up weakening the linked paper. Your thought aptly describes a problem with the paper, but that problem is not present in the Turing test by its original formulation.

akoboldfrying•2h ago

If you're referring to the paper I linked to, their experiments use bona fide 3-party Turing tests as per Turing's original "Imitation Game".

delusional•3m ago

It's hard to say what a "bona fide 3-party Turing test" is. The paper even has a section trying to tackle that issue.

I think trying to discuss the minutia of the rules is a path that leads only to madness. The Turing test was always meant to be a philosophical game. The point was to establish a scenario in which a computer could be indistinguishable from a human. Carrying it out in reality in meaningless, unless you're willing to abandon all intuitive morality.

Quite frankly, I find the paper you linked misguided. If it was undertaken by some college students, then it's good practice, but if it was carried out by seasoned professionals they should find something better to do.

akoboldfrying•2h ago

Sure, a non-human's performance "should" be capped at ~50% for a large sample size. I think seeing a much higher percentage, like 73%, indicates systematic error in the interrogator. This -- the fact that humans are not good at detecting genuine human behaviour -- is really a problem in the Turing test itself, but I don't see a good way to solve it.

LLaMa 3.1 with the same prompt "only" managed to be judged human 56% of the time, so perhaps it's actually closer to real human behaviour.

lostmsu•1h ago

Shameless self-plug: You can try a two-way variant at https://trashtalk.borg.games/ (also have to guess relative ELO)

It would be surprising if you won't quickly learn to win.

absummer•3h ago

The original Turing game was about testing for a male of female player.

If you want to know more about that, or this research, you could try asking AI for a no-fluff summary.

The Transformer architecture and algorithm and matrix multiplication are a bit more involved. It would be hard to keep those inside your chain-of-thought / working memory and still understand what is going on here.

delusional•2h ago

> If you want to know more about that, or this research, you could try asking AI for a no-fluff summary.

Or I could just read it. With my human eyes. It's like a single page.

mannykannot•3h ago

Yes, the whole "Towers of Hanoi is a bad test case" objection is a non-sequitur here. It would be a significant objection if the machines performed well, but not given the actual outcome - it is as if an alleged chess grandmaster almost always lost against opponents of unexceptional ability.

It is actually worse than that analogy: Towers of Hanoi is a bimodal puzzle, in which players who grasp the general solution do inordinately better than those who do not, and the machines here are performing like the latter.

Lest anyone thinks otherwise, this is not a case of setting up the machines to fail, any more than the chess analogy would be. The choice of Towers of Hanoi leaves it conceivable that they do would well on tough problems, but that is not very plausible and needs to be demonstrated before it can be assumed.

vidarh•3h ago

They set it up to fail the moment they ran it with a large number of disks and assumed the models would just continue as if it ran the same simple algorithm in a loop, and the moment they set temperature to 1.

mannykannot•2h ago

I take your point that the absence of any discussion of the effect of temperature choice or justification for choosing 1 seems to be an issue with the paper (unless it is quite obviously the only rational choice to those working in the field?)

vidarh•3h ago

> Second-graders can follow that, if motivated enough.

Try to motivate them sufficiently to do so without error for a large number of disks, I dare you.

Now repeat this experiment while randomly refusing to accept the answer they're most confident in for any given iteration, and pick an answer they're less confident in on their behalf, and insist they still solve it without error.

(To make it equivalent to the researchers running this with temperature set to 1)

munksbeer•1h ago

I could be wrong, but it seems you have misunderstood something here, and you've even quoted the part that you've misunderstood. It isn't that the algorithm for solving the problem isn't known. The LLM knows it, just like you do. It is that the steps of following the algorithm are too verbose if you're just writing them down and trying to keep track of the state of the problem in your head. Could you do that for a large number of disks?

Please do correct me if the misunderstanding is mine.

absummer•4h ago

The painful thing about achieving AGI, is that humans reasoning about AI will seem so dumb.

synthmeat•4h ago

I'm not sure about the paper and claims on the whole but the Hanoi part has received some shade here https://x.com/scaling01/status/1931783050511126954

sigmoid10•3h ago

At the end of the day, I fully expect large-n Hanoi and all these things to end up as yet another benchmark. Like all the needles-in-haystack or spelling tests that people used to show shortcomings of LLMs and that were actually just technical implementation artefacts and got solved pretty fast by integrating that kind of problem into training. LLMs will always have to use a slightly different approach to reasoning than humans because of these technical aspects, but that doesn't mean that they are fundamentally inferior or something. It only means we can't rely on human training data forever and have to look more towards stuff like RL.

ai-christianson•3h ago

All I know is I've been getting real world value out of LLMs since ~GPT3 and they've been producing more value with each release.

sigmoid10•3h ago

People also like to forget that from the dawn of modern computing and AI research like 60 years ago all the way to 7 years ago, the best models in the world could barely form a few coherent sentences. If LLMs are this century's transistor, we are barely beyond the point of building-sized computers that are trying to find normal life applications.

talles•3h ago

Someone please reply with the title "The illusion of The illusion of The illusion of Thinking".

KingMob•3h ago

To paraphrase GOB Bluth:

"Illusions, Michael! Thinking is something a whore does for money!"

...slow pan to shocked group of staring children...

"..or cocaine!"

ksec•2h ago

I am a little lost.

>The first issue I have with the paper is that Tower of Hanoi is a worse test case for reasoning than math and coding. If you’re worried that math and coding benchmarks suffer from contamination, why would you pick well-known puzzles for which we know the solutions exist in the training data?

Isn't that exactly what is wrong? It is in the training data and it cant complete it.

It simply isn't reasoning, it is second guessing a lot of things as though it is reasoning.

crustycoder•1h ago

My favourite example of the underlying probabilistic nature of LLMs is related to a niche hobby of mine, English Change Ringing. Every time someone asks an LLM a question that requires more than a basic definition of what Change Ringing is, the result is hilarious. Not only do the answers suffer from factual hallucinations, they aren't even internally logically consistent. It's literally just probabilistic word soup, and glaringly obviously so.

Although there isn't a vast corpus on Method Ringing, there is a fair amount; the "rules" are online (https://framework.cccbr.org.uk/version2/index.html), Change ringing is based on pure maths (Group Theory) and has been linked with CS from when CS first started - it's mentioned in Knuth, and the Steinhaus–Johnson–Trotter algorithm for generating permutations wasn't invented by them in the 1960's, it was known to Change Ringers in the 1650's. Think of it of Towers of Hanoi with knobs on :-) So it would seem a good fit for automated reasoning, indeed such things already exist - https://ropley.com/?page_id=25777.

If I asked a non-ringing human to explain to me how to ring Cambridge Major, they'd say "Sorry, I don't know" and an LLM with insufficient training data would probably say the same. The problem is when LLMs know just enough to be dangerous, but they don't know what they don't know. The more abstruse a topic is, the worse LLMs are going to do at it, and it's precisely those areas where people are most likely to turn to them for answers. They'll get one that's grammatically correct and sounds authoritative - but they almost certainly won't know if it's nonsense.

Adding a "reliability" score to LLM output seems eminently feasible, but due to the hype and commercial pressures around the current generation of LLMs, that's never going to happen as the pressure is on to produce plausible sounding output, even if it's bullshit.

https://www.lawgazette.co.uk/news/appalling-high-court-judge...

The New Godel Prize Winner Tastes Great and Is Less Filling

Why Quadratic Funding Is Not Optimal

Hokusai Moyo Gafu: an album of dyeing patterns

Bruteforcing the phone number of any Google user

Doctors Could Hack the Nervous System with Ultrasound

Finding Shawn Mendes (2019)

Mushrooms communicate with each other using up to 50 'words', scientist claims

LLMs are cheap

Why Android can't use CDC Ethernet (2023)

Riding high in Germany on the world's oldest suspended railway

The Child-Like Role of Dogs in Western Societies

AI Angst

Omnimax

Endangered classic Mac plastic color returns as 3D-printer filament

Administering immunotherapy in the morning seems to matter. Why?

What Is OAuth and How Does It Work?

FSE meets the FBI

What happens when people don't understand how AI works

Software is about promises

I used AI-powered calorie counting apps, and they were even worse than expected

My first attempt at iOS app development

Panjandrum: The ‘giant firework’ built to break Hitler's Atlantic Wall

CoverDrop: A secure messaging system for newsreader apps

Gaussian integration is cool

Generating Pixels One by One

Analyzing IPv4 Trades with Gnuplot

Show HN: Let’s Bend – Open-Source Harmonica Bending Trainer

FAA to eliminate floppy disks used in air traffic control systems

Cheap yet ultrapure titanium might enable widespread use in industry (2024)

Defiant loyalists paid dearly for choosing wrong side in the American Revolution