> Using only 1,000 input-output examples, without pre-training or CoT supervision, HRM learns to solve problems that are intractable for even the most advanced LLMs. For example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% accuracy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 examples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%) and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and context lengths, as shown in Figure 1.
I'm going to read this carefully, in its entirety.
Thank you for sharing it on HN!
> It uses two interdependent recurrent modules: a *high-level module* for abstract, slow planning and a *low-level module* for rapid, detailed computations. This structure enables HRM to achieve significant computational depth while maintaining training stability and efficiency, even with minimal parameters (27 million) and small datasets (~1,000 examples).
> HRM outperforms state-of-the-art CoT models on challenging benchmarks like Sudoku-Extreme, Maze-Hard, and the Abstraction and Reasoning Corpus (ARC-AGI), where CoT methods fail entirely. For instance, it solves 96% of Sudoku puzzles and achieves 40.3% accuracy on ARC-AGI-2, surpassing larger models like Claude 3.7 and DeepSeek R1.
Erm what? How? Needs a computer and sitting down.
The paper seems to only study problems like sudoku solving, and not question answering or other applications of LLMs. Furthermore they omit a section for future applications or fusion with current LLMs.
I think anyone working in this field can envision their applications, but the details to have a MoE with an HRM model could be their next paper.
I only skimmed the paper and I am not an expert, sure other will/can explain why they don't discuss such a new structure. Anyway, my post is just blissful ignorance over the complexity involved and the impossible task to predict change.
Edit: A more general idea is that Mixture of Expert is related to cluster of concepts and now we would have to consider a cluster of concepts related by the time they take to be grasped, so in a sense the model would have in latent space an estimation of the depth, number of layers, and time required for each concept, just like we adapt our reading style for a dense math book different to a newspaper short story.
torginus•1h ago
taylorius•1h ago