Hi HN, author here.
For years, it bothered me that convolution (the king of vision) and matrix multiplication / self-attention (the engine of Transformers) were treated as completely separate, specialized tools. It felt like we were missing a more fundamental principle.
This paper is my attempt to find that principle. I introduce a framework called GWO (Generalized Windowed Operation) that describes any neural operation using just three simple, orthogonal components:
Path: Where to look
Shape: What form to look for
Weight: What to value
Using this "grammar", you can express both a standard convolution and self-attention, and see them as just different points in the same design space.
But the most surprising result came when I analyzed operational complexity. I ran an experiment where different models were forced to memorize a dataset (achieving ~100% training accuracy). The results were clear: complexity used for adaptive regularization (like in Deformable Convolutions, which dynamically change their receptive field) resulted in a dramatically smaller generalization gap than "brute-force" complexity (like in Self-Attention).
This suggests that how an operation uses its complexity is more important than how much it has.
I'm an independent researcher, so getting feedback from a community like this is invaluable. I'd love to hear your thoughts and critiques. Thanks for taking a look.
The paper is here: https://doi.org/10.5281/zenodo.17103133
That's a fantastic question, and you've hit on a perfect example of the GWO framework in action.
The key difference is the level of abstraction: GWO is a general grammar to describe and design operations, while Mamba is a specific, highly-engineered model that can be described by that grammar.
In fact, as I mention in the paper, we can analyze Mamba using the (P, S, W) components:
Path (P): A structured state-space recurrence. This is a very sophisticated path designed to efficiently handle extremely long-range dependencies, unlike a simple sliding window or a dense global matrix.
Shape (S): It's causal and 1D. It processes information sequentially, respecting the nature of time-series or language data.
Weight (W): This is Mamba's superpower. The weights are highly dynamic and input-dependent, controlled by its selective state parameters. This creates an incredibly efficient, content-aware information bottleneck, allowing the model to decide what to remember and what to forget based on the context.
So, Mamba isn't a competitor to the GWO theory; it's a stellar example of it. It's a brilliant instance of "Structural Alignment" where the (P, S, W) configuration is perfectly tailored for the structure of sequential data.
Thanks for asking this, it's a great point for discussion.
umjunsik132•5h ago