Here's a quote from the article:
> How many humans can sit down and correctly work out a thousand Tower of Hanoi steps? There are definitely many humans who could do this. But there are also many humans who can’t. Do those humans not have the ability to reason? Of course they do! They just don’t have the conscientiousness and patience required to correctly go through a thousand iterations of the algorithm by hand. (Footnote: I would like to sit down all the people who are smugly tweeting about this with a pen and paper and get them to produce every solution step for ten-disk Tower of Hanoi.)
In case someone imagines that fancy recursive reasoning is necessary to solve the Towers of Hanoi, here's the algorithm to move 10 (or any even number of) disks from peg A to peg C:
1. Move one disk from peg A to peg B or vice versa, whichever move is legal.
2. Move one disk from peg A to peg C or vice versa, whichever move is legal.
3. Move one disk from peg B to peg C or vice versa, whichever move is legal.
4. Goto 1.
Second-graders can follow that, if motivated enough.
There's now constant, nonstop, obnoxious shouting on every channel about how these AI models have solved the Turing test (one wonders just how stupid these "evaluators" were), are at the level of junior devs (LOL), and actually already have "PhD level" reasoning capabilities.
I don't know who is supposed to be fooled -- we have access to these things, we can try them. One can easily knock out any latest version of GPT-PhD-level-model-of-the-week with a trivial question. Nothing fundamentally changed about that since GPT-2.
The hype and the observable reality are now so far apart that one really has to wonder: Are people this easily gullible? Or do so many people in tech benefit from the hype train that they don't want to rain on the parade?
Huh? Schoolteachers and university professors complaining about being unable to distinguish ChatGPT-written essay answers from student-written essay answers is literally ChatGPT passing the Turing test in real time.
ChatGPT 4.5 was judged to be the human 73% of the time in this RCT study, where human interrogators had 5-minute conversations with a human and an LLM: https://arxiv.org/pdf/2503.23674
This is obviously not quite what people understand the Turing test as anymore, and I think that interpretation confusion actually ends up weakening the linked paper. Your thought aptly describes a problem with the paper, but that problem is not present in the Turing test by its original formulation.
I think trying to discuss the minutia of the rules is a path that leads only to madness. The Turing test was always meant to be a philosophical game. The point was to establish a scenario in which a computer could be indistinguishable from a human. Carrying it out in reality in meaningless, unless you're willing to abandon all intuitive morality.
Quite frankly, I find the paper you linked misguided. If it was undertaken by some college students, then it's good practice, but if it was carried out by seasoned professionals they should find something better to do.
LLaMa 3.1 with the same prompt "only" managed to be judged human 56% of the time, so perhaps it's actually closer to real human behaviour.
It would be surprising if you won't quickly learn to win.
If you want to know more about that, or this research, you could try asking AI for a no-fluff summary.
The Transformer architecture and algorithm and matrix multiplication are a bit more involved. It would be hard to keep those inside your chain-of-thought / working memory and still understand what is going on here.
Or I could just read it. With my human eyes. It's like a single page.
It is actually worse than that analogy: Towers of Hanoi is a bimodal puzzle, in which players who grasp the general solution do inordinately better than those who do not, and the machines here are performing like the latter.
Lest anyone thinks otherwise, this is not a case of setting up the machines to fail, any more than the chess analogy would be. The choice of Towers of Hanoi leaves it conceivable that they do would well on tough problems, but that is not very plausible and needs to be demonstrated before it can be assumed.
Try to motivate them sufficiently to do so without error for a large number of disks, I dare you.
Now repeat this experiment while randomly refusing to accept the answer they're most confident in for any given iteration, and pick an answer they're less confident in on their behalf, and insist they still solve it without error.
(To make it equivalent to the researchers running this with temperature set to 1)
Please do correct me if the misunderstanding is mine.
"Illusions, Michael! Thinking is something a whore does for money!"
...slow pan to shocked group of staring children...
"..or cocaine!"
>The first issue I have with the paper is that Tower of Hanoi is a worse test case for reasoning than math and coding. If you’re worried that math and coding benchmarks suffer from contamination, why would you pick well-known puzzles for which we know the solutions exist in the training data?
Isn't that exactly what is wrong? It is in the training data and it cant complete it.
It simply isn't reasoning, it is second guessing a lot of things as though it is reasoning.
Although there isn't a vast corpus on Method Ringing, there is a fair amount; the "rules" are online (https://framework.cccbr.org.uk/version2/index.html), Change ringing is based on pure maths (Group Theory) and has been linked with CS from when CS first started - it's mentioned in Knuth, and the Steinhaus–Johnson–Trotter algorithm for generating permutations wasn't invented by them in the 1960's, it was known to Change Ringers in the 1650's. Think of it of Towers of Hanoi with knobs on :-) So it would seem a good fit for automated reasoning, indeed such things already exist - https://ropley.com/?page_id=25777.
If I asked a non-ringing human to explain to me how to ring Cambridge Major, they'd say "Sorry, I don't know" and an LLM with insufficient training data would probably say the same. The problem is when LLMs know just enough to be dangerous, but they don't know what they don't know. The more abstruse a topic is, the worse LLMs are going to do at it, and it's precisely those areas where people are most likely to turn to them for answers. They'll get one that's grammatically correct and sounds authoritative - but they almost certainly won't know if it's nonsense.
Adding a "reliability" score to LLM output seems eminently feasible, but due to the hype and commercial pressures around the current generation of LLMs, that's never going to happen as the pressure is on to produce plausible sounding output, even if it's bullshit.
https://www.lawgazette.co.uk/news/appalling-high-court-judge...
davedx•4h ago
I found this comment to be relevant: "Keep in mind this whitepaper is really just Apple circling the wagons because they have dick for proprietary AI tech."
When you question the source, it really does raise eyebrows, especially as an Apple shareholder: that these Apple employees are busy not working on their own AI programme that's now insanely far behind other big tech companies, but are instead spending their time casting shade on the reasoning models developed at other AI labs.
What's the motivation here, really? The paper itself isn't particularly insightful or ground-breaking.
smitty1e•4h ago
Now, if we fed the relevant references into an AI model, would the model offer this as a possible motive for the paper in question?
K0balt•4h ago
emp17344•4h ago
dandellion•4h ago
android521•4h ago
emp17344•3h ago
reliabilityguy•3h ago
How do you know it was peer-reviewed? What venue had accepted this paper for publication?
tough•3h ago
this was certainly a first for me when i saw it pop on hn the other day
reliabilityguy•3h ago
Doesn’t mean they are peer reviewed
tough•3h ago
tikhonj•4h ago
People's time and attention is not fungible—especially in inherently creative pursuits like research—and the mindset in your comment is exactly the sort of superficial administrative reasoning that leads to hype bubbles unconstrained by reality.
"Why are you wasting your time trying to understand what we're doing instead of rushing ahead without thinking" is absolutely something I've heard from managers and executives, albeit phrased more politically, and it never ends well in a holistic accounting.
JimDabell•3h ago