Weird.
And it has been discussed to death already:
Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems) [https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-gen...]
Seven replies to the viral Apple reasoning paper and why they fall short [https://news.ycombinator.com/item?id=44278403]
It is absolutely obvious that algorithmic problems like the Tower of Hanoi can't benefit from sampling. Also, algorithmic problems are domains that are comfortable for the paper authors to have a verifiable domain of puzzles, but are very far from what we want the models to do, and what they are good at. Models would solve this by implementing an algorithm in Python and calling a tool to execute it. This is how they can more easily solve such problems.
Moreover: in most benchmarks CoT improves LLMs performances a lot, because sampling helps immensely to provide a better reply. So this paper negative result is basically against a very vast experience of CoT being a powerful tool for LLMs, simply because most benchmarks operate on domains where sampling is very useful.
In short, the Apple paper mostly says things that were very obvious: it is like if they were trying to reach a negative result. It was a widespread vision that CoT can't help performing algorithmic work by concatenating tokens, if not in the most obvious ways. Yet, it helps a lot when there is to combine existing (inside the model) knowedge/ideas to provide a better reply.
Apple's point is that if we want to build something smarter than us, we need to look at intelligence and reasoning from a different angle.
E.g.
https://news.ycombinator.com/item?id=44203562
https://news.ycombinator.com/item?id=44221900
https://news.ycombinator.com/item?id=44234626
I think if you added a step where the LLMs tweak their own build process and redeploy, your experiment would have wildly different results.
Basically so called "reasoning" is just generation of additional intermediary output, resembling real reasoning, but not being it.
https://transformer-circuits.pub/2025/attribution-graphs/bio...
leotsem•6h ago
Instead of relying on standard benchmarks, the authors designed controlled environments—like Tower of Hanoi and River Crossing puzzles—to test how models handle increasing compositional complexity. The results: performance doesn’t taper off, it collapses. And even when the models fail, they continue to produce fluent, structured reasoning traces that sound convincing but fall apart logically.
If you’re building on top of LLMs or reasoning-augmented models, it’s well worth a look.
salviati•6h ago
I heard about that paper through an "AI explained" video [0], so I might be biased, but I agree with that video that the Apple paper is "meh" at best: it points out LLM limitations that are hardly a surprise.
[0] https://www.youtube.com/watch?v=wPBD6wTap7g
vincnetas•5h ago
saithound•5h ago
vincnetas•5h ago
rcarmo•3h ago
ForHackernews•6h ago
To me, that paper was reassuring that I wasn't taking crazy pills. I've worked with these tools to produce code, and they routinely make mistakes that no thinking entity (yes, I've worked with some dimwitted junior devs) ever would. Yes, they are powerful and useful tools, but they're not "thinking" in any meaningful sense (defined here as a rigorously determining an algorithm and applying it correctly).