Author here.
I’ve been trying to answer a specific question: Why do "technically superior" architectures (like Neural ODEs, KANs, or pure SSMs) constantly fail to displace the Transformer?
My thesis is that we are looking at the wrong metric. We usually look at "flops per token" or convergence rates. But in reality, hardware imposes a "compute tax" based on how much an idea deviates from optimized GPU primitives like dense matrix multiplications (GEMMs).
I call this the Hardware Friction Map, and I’ve categorized architectures into four zones based on the engineering cost to clear "Gate 1" (viability):
1. Green Zone (Low Friction): Things like RoPE or GQA. They ship in months because they map to existing kernels.
2. Yellow Zone (Kernel Friction): FlashAttention is the standard here. Even though the math worked in 2022, it took 20+ months to become universal because of the "ecosystem tax" (integration into PyTorch, vLLM, etc.).
3. Orange Zone (System Friction): This is where MoEs sit. Everyone talks about DeepSeek V3, but we forget they had to rewrite their cluster scheduler and spend 6 months on infra to make it work. That high friction is a moat for them, but often a death sentence for startups who don't have the runway to debug distributed routing logic.
4. Red Zone (Prohibitive Friction): Architectures like KANs. They rely on tiny, irregular spline evaluations that drop tensor core utilization to ~10%. They are theoretically elegant but economically unshippable.
I also did a deep dive into the "Context Trap" for MoEs (throughput dropping ~60% at 32k context due to routing overhead) and why pure SSMs seem to hit a "scalability cliff" at 13B parameters, forcing hybrids like Jamba.
I’ve open-sourced a dataset scoring 100+ architectures on this friction scale (linked in the post). Curious to hear if others are seeing this "friction" kill internal projects.
speiroxaiti•2h ago