The Tradeoffs of SSMs and Transformers

https://goombalab.github.io/blog/2025/tradeoffs/

37•jxmorris12•6h ago

Comments

macleginn•4h ago

The part on tokenisation is not very convincing. Replacing BPE with characters or even bytes will not "remove tokenisation" -- atoms will still be tokens, relating to different things in different cultures/writing traditions (a "Chinese byte" is a part of a Chinese character; an "English byte" is basicaly a letter or a number) and not relating to something fundamentally linguistic. BPE can be thought of as another way of representing linguistic sequences with symbols of some kind; it provides less inductive bias into the use of language, but it is not perhaps categorically different from any kind of writing.

Herring•4h ago

I'm a bit bearish on SSMs (and hybrid SSM/transformers) because the leading open weight models (DeepSeek, Qwen, Gemma, Llama) are all transformers. There's just no way none of them tried SSMs.

visarga•4h ago

Yes, until serious adoption I am reserved too, both on SSMs and diffusion based LLMs.

nextos•3h ago

Second-generation LSTMs (xLSTM) do have leading performance on zero-shot time series forecasting: https://arxiv.org/abs/2505.23719.

I think other architectures, aside from the transformer, might lead to SOTA performance, but they remain a bit unexplored.

programjames•2h ago

I mean, everyone is still using variational autoencoders for their latent flow models instead of the information bottleneck. It's because it's cheaper (in founder time) to raise 10(0)x more money instead of having to design your own algorithms and architectures for a novel idea that might work in theory, but could be a dead end six months down the line. Just look at LiquidAI. Brilliant idea, but it took them ~5 years to do all the research and another to get their first models to market... which don't yet seem to be any better than models with a similar compute requirement. I find it pretty plausible that none of the "big" LLM companies seriously tried SSMs, because they already have plenty enough money to throw at transformers, or took a quick path to get a big valuation.

mbowcut2•1h ago

I think I agree with you. My only rebuttal would be it's this kind of thinking that's kept any leading players form trying other architectures in the first place. As far as I know, SOTA for SSM's just doesn't suggest significant enough potential upsides warrant significant R&D. Not compared to the tried and true established LLM methods. The decision might be something like: "Pay X to train a competitive LLM" vs "Pay 2X to MAYBE train a competitive SSM".

Supabase MCP can leak your entire SQL database

Bootstrapping a side project into a profitable seven-figure business

Breaking Git with a carriage return and cloning RCE

Rules of good writing (2007)

Smollm3: Smol, multilingual, long-context reasoner LLM

Radium Music Editor

Xenharmlib: A music theory library that supports non-western harmonic systems

OLMo – a fully open LLM outperforming GPT 4o mini

Bulgaria to join euro area on 1 January 2026

Brut: A New Web Framework for Ruby

Dynamical origin of Theia, the last giant impactor on Earth

Plants monitor the integrity of their barrier by sensing gas diffusion

Taking over 60k spyware user accounts with SQL injection

Frame of preference A history of Mac settings, 1984–2004

Show HN: OffChess – Offline chess puzzles app

Can an email go 500 miles in 2025?

New Horizons images enable first test of interstellar navigation

GlobalFoundries to Acquire MIPS

Ceramic: A cross-platform and open-source 2D framework in Haxe

Show HN: A rain Pomodoro with brown noise, ASMR, and Middle Eastern music

The Tradeoffs of SSMs and Transformers

Blind to Disruption – The CEOs Who Missed the Future

Show HN: Jukebox – Free, Open Source Group Playlist with Fair Queueing

SVGs that feel like GIFs

New sphere-packing record stems from an unexpected source

On The Meaning of Ritual

Particle Lenia Deluxe Edition

Inertial forces (indirect terms) in problems with a central body

Mercury: Ultra-fast language models based on diffusion

I used o3 to profile myself from my saved Pocket links

The Tradeoffs of SSMs and Transformers

Comments

Supabase MCP can leak your entire SQL database

Bootstrapping a side project into a profitable seven-figure business

Breaking Git with a carriage return and cloning RCE

Rules of good writing (2007)

Smollm3: Smol, multilingual, long-context reasoner LLM

Radium Music Editor

Xenharmlib: A music theory library that supports non-western harmonic systems

OLMo – a fully open LLM outperforming GPT 4o mini

Bulgaria to join euro area on 1 January 2026

Brut: A New Web Framework for Ruby

Dynamical origin of Theia, the last giant impactor on Earth

Plants monitor the integrity of their barrier by sensing gas diffusion

Taking over 60k spyware user accounts with SQL injection

Frame of preference A history of Mac settings, 1984–2004

Show HN: OffChess – Offline chess puzzles app

Can an email go 500 miles in 2025?

New Horizons images enable first test of interstellar navigation

GlobalFoundries to Acquire MIPS

Ceramic: A cross-platform and open-source 2D framework in Haxe

Show HN: A rain Pomodoro with brown noise, ASMR, and Middle Eastern music

The Tradeoffs of SSMs and Transformers

Blind to Disruption – The CEOs Who Missed the Future

Show HN: Jukebox – Free, Open Source Group Playlist with Fair Queueing

SVGs that feel like GIFs

New sphere-packing record stems from an unexpected source

On The Meaning of Ritual

Particle Lenia Deluxe Edition

Inertial forces (indirect terms) in problems with a central body

Mercury: Ultra-fast language models based on diffusion

I used o3 to profile myself from my saved Pocket links