frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

https://arxiv.org/abs/2506.01963
44•PaulHoule•7h ago

Comments

zoklet-enjoyer•6h ago
I don't know what those words mean, but I am excited for the possibilities.
PaulHoule•5h ago
LLMs can look back over a certain number (N) of tokens, which roughly correspond to words. For instance if you want to summarize or answer questions about a document accurately the length of the document has to be less than N.

Conventionally they use an attention mechanism that compares every token to every other token which has a cost of N*N or N squared which is quadratic. If you want LLMs to chew over a huge amount of context (all the source code for your project) it’s a problem so people are looking for ways around this.

zoklet-enjoyer•5h ago
Thank you for that explanation
rybosome•5h ago
Adding to that excellent high level explanation of what the attention mechanism is, I’d add (from my reading of the abstract of this paper);

This work builds a model that has the ability to “remember” parts of its previous input when generating and processing new input, and has part of its intelligence devoted to determining what is relevant to remember.

This is in lieu of kind of saying “I need to keep re-reading what I’ve already read and said to keep going”.

I’d welcome better explanations. :)

Icko_•5h ago
Not even that. With KV-caching, it's linear with the size of the context; and if someone figured out a way to have e.g. NlogN complexity, I imagine with KV-caching it may go down to logN complexity. (If the new algorithm permits that.)
imranq•5h ago
I like the idea of removing quadratic scaling for attention, this paper has thin experimental support. No real tasks tested beyond perplexity. Nothing on reasoning, retrieval QA, or summarization quality. Even in perplexity the gains are marginal.

However it removes attention so I think its worth watching that space of non-attention models

yorwba•5h ago
This paper seems rather unfocused, explaining their architecture three times with slight variations while managing to omit crucial details like how exactly they compute gradients for their "External Retrieval Memory."

Also, the section on DeepSeek is really weird: "While the precise architectural details of DeepSeek LLM are still emerging, early discussions suggest that it relies on an extended Transformer backbone or a "hybrid" approach that likely incorporates some form of attention-based mechanism, potentially at specific layers or across chunk boundaries, to facilitate information flow across large contexts." It makes it sound like a mystery, even though there have been multiple papers published on it (they cite the R1 one) so that there's really no need to guess whether attention is involved.

Overall I'm not convinced the authors know what they're doing.

roxolotl•5h ago
Would you say they aren’t paying attention?
cubefox•4h ago
I think it's fair to say they are explicitly avoiding attention.
albertzeyer•5h ago
"hundreds of thousands to potentially millions of tokens" - that's the same order as current commercial LLMs.

Also note, if the sequence length is not really much larger than the model dimension (at least two orders of magnitude more), the quadratic complexity of the self-attention is really not such a big issue - the matrix multiplication in the feed-forward layers will be usually 8x the model dimension squared, and thus that part will usually dominate.

Also note that there has been so much research on this already. While this particular approach might be novel, there has been attempts to avoid the O(n^2) complexity in self-attention basically almost since the original transformer paper came out in 2017. I wonder a bit that this paper does not cite xLSTM, or Block-Recurrent Transformers.

Also, this paper comes very short in experiments. There is basically only table 2. There is no study on length extrapolation (which is very relevant for the topic), or needle-in-haystack experiments, or scaling studies, any larger scale experiments, etc. Also, even in this main table 2, I see a couple of typos. And looking at the results in table 2, the improvements seems to be quite minor.

So I would conclude, this needs a lot more work.

cubefox•4h ago
> "hundreds of thousands to potentially millions of tokens" - that's the same order as current commercial LLMs.

Yes, but those are all relying on proprietary company secrets, while this is an open research paper. Besides, only Gemini so far has a context window of more than a million tokens.

littlestymaar•4h ago
Llama 4 Scout has it also, and is an open weight LLM, unfortunately it is also disappointing at pretty much any context length…
3abiton•4h ago
> Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely.

I skimmed the paper, and unlike transformers they basically can scale much more efficiently with longer context. While it's possible to fit 1M token, you need a significant amount of memory. Alrhough they benchmark against GPT2, so I would say quite preliminary work so far, although promising architecture.

boroboro4•4h ago
> Also note, if the sequence length is not really much larger than the model dimension (at least two orders of magnitude more), the quadratic complexity of the self-attention is really not such a big issue - the matrix multiplication in the feed-forward layers will be usually 8x the model dimension squared, and thus that part will usually dominate.

This is incorrect in case of batched inference. There are two bottlenecks at play: compute and memory, and your reasoning applies to compute. In case of memory it gets trickier: for MLP layers you’ll need to read same set of weights for all elements of your batch, while for kv cache for attention elements will be different. That’s why in practice the real length where attention dominates would be closer to model dimension / batch size, rather than just model dimension. And this number isn’t as high anymore.

daxfohl•4h ago
Partially related, is charging by token sustainable for LLM shops? If the compute requirements go up quadratically, doesn't that mean cost should as well?
sakras•4h ago
Typically requests are binned by context length so that they can be batched together. So you might have a 10k bin and a 50k bin and a 500k bin, and then you drop context past 500k. So the costs are fixed per-bin.
daxfohl•1h ago
Makes sense, and each model has a max context length, so they could charge per token assuming full context by model if they wanted to assume worst case.
maxrmk•4h ago
> While the specific internal workings of DeepSeek LLM are still being elucidated, it appears to maintain or approximate the self-attention paradigm to some extent.

Totally nonsensical. Deepseeks architecture is well documented, multiple implementations are available online.

gsf_emergency•1h ago
https://github.com/andrew-jeremy/nonAttentionLLM

DRM Can Watch You Too: Privacy Effects of Browsers' Widevine EME (2023)

https://hal.science/hal-04179324v1/document
74•exceptione•4h ago•38 comments

Snorting the AGI with Claude Code

https://kadekillary.work/blog/#2025-06-16-snorting-the-agi-with-claude-code
214•beigebrucewayne•15h ago•121 comments

What Happens When Clergy Take Psilocybin

https://nautil.us/clergy-blown-away-by-psilocybin-1217112/
70•bookofjoe•5h ago•63 comments

Show HN: Canine – A Heroku alternative built on Kubernetes

https://github.com/czhu12/canine
158•czhu12•8h ago•76 comments

Show HN: Chawan TUI web browser

https://chawan.net/news/chawan-0-2-0.html
175•shiomiru•5h ago•23 comments

Benzene at 200

https://www.chemistryworld.com/opinion/benzene-at-200/4021504.article
180•Brajeshwar•11h ago•92 comments

Battle to eradicate invasive pythons in Florida achieves milestone

https://phys.org/news/2025-06-eradicate-invasive-pythons-florida-stunning.html
20•wglb•4h ago•15 comments

Show HN: Nexus.js - Fabric.js for 3D

https://punk.cam/lab/nexus
43•ges•6h ago•18 comments

ZX Spectrum Graphics Magic: The Basics Every Spectrum Fan Should Know

https://zxonline.net/zx-spectrum-graphics-magic-the-basics-every-spectrum-fan-should-know/
3•ibobev•1d ago•0 comments

Retrobootstrapping Rust for some reason

https://graydon2.dreamwidth.org/317484.html
101•romac•6h ago•34 comments

Dull Men’s Club

https://www.theguardian.com/society/2025/jun/09/meet-the-members-of-the-dull-mens-club-some-of-them-would-bore-the-ears-off-you
77•herbertl•8h ago•42 comments

Open-Source RISC-V: Energy Efficiency of Superscalar, Out-of-Order Execution

https://arxiv.org/abs/2505.24363
63•PaulHoule•9h ago•15 comments

OpenAI wins $200M U.S. defense contract

https://www.cnbc.com/2025/06/16/openai-wins-200-million-us-defense-contract.html
83•erikrit•4h ago•50 comments

Blaze (YC S24) Is Hiring

https://www.ycombinator.com/companies/blaze-2/jobs/dzNmNuw-junior-software-engineer
1•faiyamrahman•5h ago

What I talk about when I talk about IRs

https://bernsteinbear.com/blog/irs/
7•surprisetalk•3d ago•1 comments

OpenTelemetry for Go: Measuring overhead costs

https://coroot.com/blog/opentelemetry-for-go-measuring-the-overhead/
98•openWrangler•11h ago•34 comments

Show HN: Zeekstd – Rust Implementation of the ZSTD Seekable Format

https://github.com/rorosen/zeekstd
177•rorosen•1d ago•40 comments

Working on databases from prison

https://turso.tech/blog/working-on-databases-from-prison
709•dvektor•14h ago•454 comments

Nanonets-OCR-s – OCR model that transforms documents into structured markdown

https://huggingface.co/nanonets/Nanonets-OCR-s
287•PixelPanda•20h ago•66 comments

Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

https://arxiv.org/abs/2506.01963
44•PaulHoule•7h ago•19 comments

Ask HN: How to Deal with a Bad Manager?

23•finik_throwaway•1h ago•22 comments

Is gravity just entropy rising? Long-shot idea gets another look

https://www.quantamagazine.org/is-gravity-just-entropy-rising-long-shot-idea-gets-another-look-20250613/
265•pseudolus•1d ago•228 comments

Adding public transport data to Transitous

https://www.volkerkrause.eu/2025/06/14/transitous-adding-data.html
49•todsacerdoti•2d ago•0 comments

Show HN: dk – A script runner and cross-compiler, written in OCaml

https://diskuv.com/dk/help/latest/
53•beckford•11h ago•7 comments

Identity Assertion Authorization Grant

https://www.ietf.org/archive/id/draft-parecki-oauth-identity-assertion-authz-grant-03.html
7•mooreds•4d ago•3 comments

ZjsComponent: A Pragmatic Approach to Reusable UI Fragments for Web Development

https://arxiv.org/abs/2506.11016
64•lelanthran•11h ago•44 comments

Finland warms up the world's largest sand battery, the economics look appealing

https://techcrunch.com/2025/06/16/finland-warms-up-the-worlds-largest-sand-battery-and-the-economics-look-appealing/
9•pseudolus•33m ago•0 comments

WhatsApp introduces ads in its app

https://www.nytimes.com/2025/06/16/technology/whatsapp-ads.html
238•greenburger•12h ago•327 comments

Transparent peer review to be extended to all of Nature's research papers

https://www.nature.com/articles/d41586-025-01880-9
108•rntn•7h ago•58 comments

Occurences of swearing in the Linux kernel source code over time

https://www.vidarholen.net/contents/wordcount/#fuck*,shit*,damn*,idiot*,retard*,crap*
151•microsoftedging•2d ago•223 comments