Another article about H-Nets: https://main-horse.github.io/posts/hnet-inf/
Thank you for sharing on HN!
---
EDIT: The hierarchical composition and routing aspects of this work vaguely remind me of https://github.com/glassroom/heinsen_routing/ but it has been a while since I played with that. UPDATE: After spending a bit more time on the OP, it's different, but the ideas are related, like routing based on similarity.
Since it's end to end this allows them to apply this process not only to raw byte encodings but basically representations of any level, such as stacking two stages of aggregation one after another.
So in principle they could either let the model do its thing on raw bytes of an image or alternatively maybe cut it up into tiny patches ViT-style and feed that to their H-Net.
I wonder how hard would it be to adapt chunking to work in 2D and what would that even look like.
Some other notes on how multimodal inputs could be handled using this architecture are mentioned in Albert Gu's (one of the author's) blog, although only briefly, there's still much to figure out it would seem: https://goombalab.github.io/blog/2025/hnet-future/#alternati...
Big, if so.
In fact in Gu's blog post (linked in a post below) it's mentioned that they created a Mamba model that used this in place of the tokenizer.
Doesn't this architecture also treat all inputs equally? It seems like an encoder that preprocesses the input by inferring hierarchy. But don't all models essentially do that while training?
I sort of disagree with the assertion that "language is fundamentally hierarchical" in that it supposes there is a single abstraction hierarchy that's universally preferable or correct. That's just not true. It doesn't hurt anybody and it's definitely simpler to choose just one useful one (a hierarchy) but why learn only one? Why not learn multiple and also learn how to modulate between them?
marviel•8h ago
> 1. H-Nets scale better with data than state-of-the-art Transformers with BPE tokenization, while learning directly from raw bytes. This improved scaling is even more pronounced on domains without natural tokenization boundaries, like Chinese, code, and DNA.
> 2. H-Nets can be stacked together to learn from deeper hierarchies, which further improves performance.
> 3. H-Nets are significantly more robust to small perturbations in input data like casing, showing an avenue for creating models that are more robust and aligned with human reasoning.
marviel•7h ago
paper