Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

https://arxiv.org/abs/2506.01963

70•PaulHoule•7mo ago

Comments

zoklet-enjoyer•7mo ago

I don't know what those words mean, but I am excited for the possibilities.

PaulHoule•7mo ago

LLMs can look back over a certain number (N) of tokens, which roughly correspond to words. For instance if you want to summarize or answer questions about a document accurately the length of the document has to be less than N.

Conventionally they use an attention mechanism that compares every token to every other token which has a cost of N*N or N squared which is quadratic. If you want LLMs to chew over a huge amount of context (all the source code for your project) it’s a problem so people are looking for ways around this.

zoklet-enjoyer•7mo ago

Thank you for that explanation

rybosome•7mo ago

Adding to that excellent high level explanation of what the attention mechanism is, I’d add (from my reading of the abstract of this paper);

This work builds a model that has the ability to “remember” parts of its previous input when generating and processing new input, and has part of its intelligence devoted to determining what is relevant to remember.

This is in lieu of kind of saying “I need to keep re-reading what I’ve already read and said to keep going”.

I’d welcome better explanations. :)

Icko_•7mo ago

Not even that. With KV-caching, it's linear with the size of the context; and if someone figured out a way to have e.g. NlogN complexity, I imagine with KV-caching it may go down to logN complexity. (If the new algorithm permits that.)

yorwba•7mo ago

When people say that attention is quadratic, they mean that the cost to process n tokens is O(n²), so the amortized cost per token is indeed O(n). KV-caching is a way to maintain that amortized cost when appending tokens one at a time instead of ingesting the whole sequence at once. But in the end people want to be able to generate multiple tokens, so we're back at O(n²) total time again.

IIRC there are some FFT-based attention alternatives where encoding has complexity O(n log n), but there's no feasible way to cache anything and after appending a single token it costs O(n log n) again, so if you generate n tokens in sequence, the cost is actually O(n² log n).

imranq•7mo ago

I like the idea of removing quadratic scaling for attention, this paper has thin experimental support. No real tasks tested beyond perplexity. Nothing on reasoning, retrieval QA, or summarization quality. Even in perplexity the gains are marginal.

However it removes attention so I think its worth watching that space of non-attention models

yorwba•7mo ago

This paper seems rather unfocused, explaining their architecture three times with slight variations while managing to omit crucial details like how exactly they compute gradients for their "External Retrieval Memory."

Also, the section on DeepSeek is really weird: "While the precise architectural details of DeepSeek LLM are still emerging, early discussions suggest that it relies on an extended Transformer backbone or a "hybrid" approach that likely incorporates some form of attention-based mechanism, potentially at specific layers or across chunk boundaries, to facilitate information flow across large contexts." It makes it sound like a mystery, even though there have been multiple papers published on it (they cite the R1 one) so that there's really no need to guess whether attention is involved.

Overall I'm not convinced the authors know what they're doing.

roxolotl•7mo ago

Would you say they aren’t paying attention?

cubefox•7mo ago

I think it's fair to say they are explicitly avoiding attention.

NitpickLawyer•7mo ago

Hate to be that guy, but this screams LLM-generated to me. Between the titles, the vague explanations, the vague concepts, and the overall amount of fluff to data, I'd bet good money that this was generated with an LLM.

It's not inherently bad to use an LLM for consistency, language and overall sprucing up, but this is taking it a bit too far. It seems like they've prompted it to explain some notes, but it's unsure how well it did, since the notes themselves (i.e. data, experiments, etc) are missing. And it seems poorly prompted in that it consists of lots of fluff paragraphs, devoid of core knowledge, going round and round explaining the same concepts with different words.

In the end the responsibility for the end product is alsways on the submitter. This whole paper could have been a prompt, and it's worrying that this is accepted at such a prestigious school.

albertzeyer•7mo ago

"hundreds of thousands to potentially millions of tokens" - that's the same order as current commercial LLMs.

Also note, if the sequence length is not really much larger than the model dimension (at least two orders of magnitude more), the quadratic complexity of the self-attention is really not such a big issue - the matrix multiplication in the feed-forward layers will be usually 8x the model dimension squared, and thus that part will usually dominate.

Also note that there has been so much research on this already. While this particular approach might be novel, there has been attempts to avoid the O(n^2) complexity in self-attention basically almost since the original transformer paper came out in 2017. I wonder a bit that this paper does not cite xLSTM, or Block-Recurrent Transformers.

Also, this paper comes very short in experiments. There is basically only table 2. There is no study on length extrapolation (which is very relevant for the topic), or needle-in-haystack experiments, or scaling studies, any larger scale experiments, etc. Also, even in this main table 2, I see a couple of typos. And looking at the results in table 2, the improvements seems to be quite minor.

So I would conclude, this needs a lot more work.

cubefox•7mo ago

> "hundreds of thousands to potentially millions of tokens" - that's the same order as current commercial LLMs.

Yes, but those are all relying on proprietary company secrets, while this is an open research paper. Besides, only Gemini so far has a context window of more than a million tokens.

littlestymaar•7mo ago

Llama 4 Scout has it also, and is an open weight LLM, unfortunately it is also disappointing at pretty much any context length…

3abiton•7mo ago

> Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely.

I skimmed the paper, and unlike transformers they basically can scale much more efficiently with longer context. While it's possible to fit 1M token, you need a significant amount of memory. Alrhough they benchmark against GPT2, so I would say quite preliminary work so far, although promising architecture.

boroboro4•7mo ago

> Also note, if the sequence length is not really much larger than the model dimension (at least two orders of magnitude more), the quadratic complexity of the self-attention is really not such a big issue - the matrix multiplication in the feed-forward layers will be usually 8x the model dimension squared, and thus that part will usually dominate.

This is incorrect in case of batched inference. There are two bottlenecks at play: compute and memory, and your reasoning applies to compute. In case of memory it gets trickier: for MLP layers you’ll need to read same set of weights for all elements of your batch, while for kv cache for attention elements will be different. That’s why in practice the real length where attention dominates would be closer to model dimension / batch size, rather than just model dimension. And this number isn’t as high anymore.

daxfohl•7mo ago

Partially related, is charging by token sustainable for LLM shops? If the compute requirements go up quadratically, doesn't that mean cost should as well?

sakras•7mo ago

Typically requests are binned by context length so that they can be batched together. So you might have a 10k bin and a 50k bin and a 500k bin, and then you drop context past 500k. So the costs are fixed per-bin.

daxfohl•7mo ago

Makes sense, and each model has a max context length, so they could charge per token assuming full context by model if they wanted to assume worst case.

maxrmk•7mo ago

> While the specific internal workings of DeepSeek LLM are still being elucidated, it appears to maintain or approximate the self-attention paradigm to some extent.

Totally nonsensical. Deepseeks architecture is well documented, multiple implementations are available online.

gsf_emergency•7mo ago

https://github.com/andrew-jeremy/nonAttentionLLM

ljlolel•7mo ago

This needs way more charts and graphs

juank10•7mo ago

Funnily enough, the code was deleted in the repo, but can still be seen in the commits. It's what you would expect from the paper :D

On the general topic of non-attention LLMs, I recommend checking out the MesaNet [1], Rodimus [2], Gated DeltaNet [3], or Mamba2 [4]. They are currently SOTA.

However, I have yet to see a compelling non attention based model that achieves good performance on code, math, reasoning, or multi-turn QA tasks. I do not think we are getting rid of attention soon, I believe the ability to look back is crucial in certain tasks. [1] https://arxiv.org/abs/2506.05233 [2] https://arxiv.org/abs/2410.06577 [3] https://arxiv.org/abs/2412.06464 [4] https://arxiv.org/abs/2405.21060

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

AI Agent Automates Google Stock Analysis from Financial Reports

Voxtral Realtime 4B Pure C Implementation

I Was Trapped in Chinese Mafia Crypto Slavery [video]

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

Study of 150 developers shows AI generated code no harder to maintain long term

Spotify now requires premium accounts for developer mode API access

When Albert Einstein Moved to Princeton

Agents.md as a Dark Signal

System time, clocks, and their syncing in macOS

McCLIM and 7GUIs – Part 1: The Counter

So whats the next word, then? Almost-no-math intro to transformer models

Ed Zitron: The Hater's Guide to Microsoft

UK infants ill after drinking contaminated baby formula of Nestle and Danone

Show HN: Android-based audio player for seniors – Homer Audio Player

Starter Template for Ory Kratos

LLMs are powerful, but enterprises are deterministic by nature

Make your iPad 3 a touchscreen for your computer

Internationalization and Localization in the Age of Agents

Building a Custom Clawdbot Workflow to Automate Website Creation

Why the "Taiwan Dome" won't survive a Chinese attack

Xkcd: Game AIs

Windows 11 is finally killing off legacy printer drivers in 2026

From Offloading to Engagement (Study on Generative AI)

AI for People

Rome is studded with cannon balls (2022)

8-piece tablebase development on Lichess (op1 partial)

US to bankroll far-right think tanks in Europe against digital laws