Sapients paper on the concept of Hierarchical Reasoning Model

https://arxiv.org/abs/2506.21734

63•hansmayer•4h ago

Comments

torginus•2h ago

Is it just me or are symbolic (or as I like to call it 'video game') AI is seeping back into AI?

taylorius•2h ago

Perhaps so - but represented in a trainable, neural form. Very exciting!

bobosha•46m ago

But symbolic != hierarchical

cs702•2h ago

Based on a quick first skim of the abstract and the introduction, the results from hierarchical reasoning (HRM) models look incredible:

> Using only 1,000 input-output examples, without pre-training or CoT supervision, HRM learns to solve problems that are intractable for even the most advanced LLMs. For example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% accuracy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 examples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%) and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and context lengths, as shown in Figure 1.

I'm going to read this carefully, in its entirety.

Thank you for sharing it on HN!

diwank•2h ago

Exactly!

> It uses two interdependent recurrent modules: a *high-level module* for abstract, slow planning and a *low-level module* for rapid, detailed computations. This structure enables HRM to achieve significant computational depth while maintaining training stability and efficiency, even with minimal parameters (27 million) and small datasets (~1,000 examples).

> HRM outperforms state-of-the-art CoT models on challenging benchmarks like Sudoku-Extreme, Maze-Hard, and the Abstraction and Reasoning Corpus (ARC-AGI), where CoT methods fail entirely. For instance, it solves 96% of Sudoku puzzles and achieves 40.3% accuracy on ARC-AGI-2, surpassing larger models like Claude 3.7 and DeepSeek R1.

Erm what? How? Needs a computer and sitting down.

mkagenius•36m ago

Is it talking about fine tuning existing models with 1000 examples to beat them in those tasks?

electroglyph•2h ago

but does it scale?

lispitillo•2h ago

I hope/fear this HRM model is going to be merged with MoE very soon. Given the huge economic pressure to develop powerful LLMs I think this can be done in just a month.

The paper seems to only study problems like sudoku solving, and not question answering or other applications of LLMs. Furthermore they omit a section for future applications or fusion with current LLMs.

I think anyone working in this field can envision their applications, but the details to have a MoE with an HRM model could be their next paper.

I only skimmed the paper and I am not an expert, sure other will/can explain why they don't discuss such a new structure. Anyway, my post is just blissful ignorance over the complexity involved and the impossible task to predict change.

Edit: A more general idea is that Mixture of Expert is related to cluster of concepts and now we would have to consider a cluster of concepts related by the time they take to be grasped, so in a sense the model would have in latent space an estimation of the depth, number of layers, and time required for each concept, just like we adapt our reading style for a dense math book different to a newspaper short story.

buster•1h ago

must say I am suspicious in this regard, as they don't show applications other than a Sudoku solver and don't discuss downsides.

Oras•1h ago

and the training was only on Sudoku. Which means they need to train a small model for every problem that currently exists.

Back to ML models?

yorwba•1h ago

This HRM is essentially purpose-designed for solving puzzles with a small number of rules interacting in complex ways. Because the number of rules is small, a small model can learn them. Because the model is small, it can be run many times in a loop to resolve all interactions.

In contrast, language modeling requires storing a large number of arbitrary phrases and their relation to each other, so I don't think you could ever get away with a similarly small model. Fortunately, a comparatively small number of steps typically seems to be enough to get decent results.

But if you tried to use an LLM-sized model in an HRM-style loop, it would be dog slow, so I don't expect anyone to try it anytime soon. Certainly not within a month.

Maybe you could have a hybrid where an LLM has a smaller HRM bolted on to solve the occasional constraint-satisfaction task.

energy123•8m ago

What about many small HRM models that solve conceptually distinct subtasks as determined and routed to by a master model who then analyzes and aggregates the outputs, with all of that learned during training.

OgsyedIE•46m ago

Skimming this, there is no reason why a MoE LLM system (whether autoregressive, diffusion, energy-based or mixed) couldn't be given a nested architecture that duplicates the layout of a HRM. Combining these in different ways should allow for some novel benchmarks around efficiency and quality, which will be interesting.

0x000xca0xfe•30m ago

Goodbye captchas I guess? Somehow they are still around.

topspin•27m ago

> "After completing the T steps, the H-module incorporates the sub-computation’s outcome (the final state L) and performs its own update. This H update establishes a fresh context for the L-module, essentially “restarting” its computational path and initiating a new convergence phase toward a different local equilibrium."

So they let the low-level RNN bottom out, evaluate the output in the high level module, and generate a new context for the low-level RNN. Rinse, repeat. The low-level RNNs are iterating backpropagation while the high-level is periodically kicking the low-level RNNs to get better outputs. Loops within loops.

Another interesting part:

> "Neuroscientific evidence shows that these cognitive modes share overlapping neural circuits, particularly within regions such as the prefrontal cortex and the default mode network. This indicates that the brain dynamically modulates the “runtime” of these circuits according to task complexity and potential rewards.

> Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that enables `thinking, fast and slow'"

A scheduler that dynamically balances resources based on the necessary depth of reasoning and the available data.

I love how this paper cites parallels with real brains throughout. I believe AGI will be solved as the primitives we're developing are composed to extreme complexity, utilizing many cooperating, competing, communicating, concurrent, specialized "modules." It is apparent to me that human brain must have this complexity, because it's the only feasible way evolution had to achieve cognition using slow, low power tissue.

JonathanRaines•9m ago

I advise scepticism.

This work does have some very interesting ideas, specifically avoiding the costs of backpropagation through time.

However, it does not appear to have been peer reviewed.

The results section is odd. It does not include include details of how they performed the assesments, and the only numerical values are in the figure on the front page. The results for ARC2 are (contrary to that figure) not top of the leaderboard (currently 19% compared to HRMs 5% https://www.kaggle.com/competitions/arc-prize-2025/leaderboa...)

When We Get Komooted

Jeff Bezos doesn't believe in PowerPoint, and his employees agree

Linux on Snapdragon X Elite: Linaro and Tuxedo Pave the Way for ARM64 Laptops

Chemical process produces critical battery metals with no waste

Sapients paper on the concept of Hierarchical Reasoning Model

Fast and cheap bulk storage: using LVM to cache HDDs on SSDs

Smallest particulate matter air quality sensor for ultra-compact IoT devices

A low power 1U Raspberry Pi cluster server for inexpensive colocation

Janet: Lightweight, Expressive, Modern Lisp

Cable Bacteria Are Living Batteries

Resizable structs in Zig

How we rooted Copilot

Purple Earth hypothesis

Implementing dynamic scope for Fennel and Lua

16colo.rs: ANSI/ASCII art archive

Rust running on every GPU

Low cost mmWave 60GHz radar sensor for advanced sensing

Coronary artery calcium testing can reveal plaque in arteries, but is underused

Reading QR codes without a computer

Personal aviation is about to get interesting (2023)

Constrained languages are easier to optimize

What went wrong for Yahoo

Teach Yourself Programming in Ten Years (1998)

Paul Dirac and the religion of mathematical beauty (2011) [video]

Show HN: QuickTunes: Apple Music player for Mac with iPod vibes

The natural diamond industry is getting rocked. Thank the lab-grown variety

Getting decent error reports in Bash when you're using 'set -e'

Three high-performance RISC-V processors to watch in H2 2025

Beyond Food and People

Arvo Pärt at 90

Sapients paper on the concept of Hierarchical Reasoning Model

Comments

When We Get Komooted

Jeff Bezos doesn't believe in PowerPoint, and his employees agree

Linux on Snapdragon X Elite: Linaro and Tuxedo Pave the Way for ARM64 Laptops

Chemical process produces critical battery metals with no waste

Sapients paper on the concept of Hierarchical Reasoning Model

Fast and cheap bulk storage: using LVM to cache HDDs on SSDs

Smallest particulate matter air quality sensor for ultra-compact IoT devices

A low power 1U Raspberry Pi cluster server for inexpensive colocation

Janet: Lightweight, Expressive, Modern Lisp

Cable Bacteria Are Living Batteries

Resizable structs in Zig

How we rooted Copilot

Purple Earth hypothesis

Implementing dynamic scope for Fennel and Lua

16colo.rs: ANSI/ASCII art archive

Rust running on every GPU

Low cost mmWave 60GHz radar sensor for advanced sensing

Coronary artery calcium testing can reveal plaque in arteries, but is underused

Reading QR codes without a computer

Personal aviation is about to get interesting (2023)

Constrained languages are easier to optimize

What went wrong for Yahoo

Teach Yourself Programming in Ten Years (1998)

Paul Dirac and the religion of mathematical beauty (2011) [video]

Show HN: QuickTunes: Apple Music player for Mac with iPod vibes

The natural diamond industry is getting rocked. Thank the lab-grown variety

Getting decent error reports in Bash when you're using 'set -e'

Three high-performance RISC-V processors to watch in H2 2025

Beyond Food and People

Arvo Pärt at 90