Skimming through conclusions and results, the authors conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI. Moral reasoning, simple things like counting that a toddler can do, etc. They're just not human and you can reasonably hypothesize most of these failures stem from their nature as next-token predictors that happen to usually do what you want.
So. If you've got OpenClaw running and thinking you've got Jarvis from Iron Man, this is probably a good read to ground yourself.
Note there's a GitHub repo compiling these failures from the authors: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failur...
chrisjj•1h ago