Hierarchical Reasoning Model – 1k training samples SoTA reasoning v/s CoT

26•dreamer7•6mo ago

Comments

dreamer7•6mo ago

To a casual observer, this seems like a big deal. Can knowledgeable folks comment on this work?

AIPedant•6mo ago

I am still reading the paper, but it is worth noting that this is not an LLM! It is closer to something like AlphaGo, trained only on ARC, Sudoku and mazes. I am skeptical that you could add a bunch of science facts and programming examples without degrading the performance on ARC / etc - frankly it’s completely unclear to me how you would make this architecture into a chatbot, period, but I haven’t thought about it very much.

Comparing the maze/Sudoku results to LLMs rather than maze/Sudoku-specific AIs strikes me as blatantly dishonest. “1k Sudoku training examples” is also dishonest, they generate about a million of them with permutations: https://news.ycombinator.com/item?id=44701264 (see also https://github.com/sapientinc/HRM/blob/main/dataset/build_su... And they seem to have deleted the Sudoku training data! Or maybe they made it private. It used to be here: https://github.com/imone and according to the Git history[1] they moved it here https://github.com/sapientinc but I cannot find it. Might be an innocent mistake; I suspect they got called out for lying about “1000 samples” and are hiding their tracks.

[1] https://github.com/sapientinc/HRM/commit/171e2fcde636bcb7e6c...

algo_trader•6mo ago

> not an LLM! closer to something like AlphaGo, trained only on ARC, Sudoku and mazes.

ah! this explains the performance..

What is the conventional wisdom on improving codegen in LLMs? Sample n solutions and verify, or run a more expensive tree search?

I have thoughts on a very elaborate add-a-function-verify-and-rollback testing harness and i wonder if this has been tried

riknos314•6mo ago

Prior thread on the paper about this: https://news.ycombinator.com/item?id=44699452

munro•6mo ago

Link to paper here https://arxiv.org/pdf/2506.21734

Still reading, but the benchmarks for ARC-AGI-1, ARC-AGI-2, Sudoku-Extreme (9x9), and Maze-Hard (30x30) look impressive.

tough•6mo ago

on gh someone reproduced but paper lacks total gpu hours and their benchmark results where 10-20% lower (read on gh issue)

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

The Optima-l Situation: A deep dive into the classic humanist sans-serif

Barn Owls Know When to Wait