frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

https://arxiv.org/abs/2602.00294
60•fheinsen•1h ago

Comments

spacewhales•1h ago
Github here: https://github.com/glassroom/sata_attention
yanosh_kunsh•1h ago
So does that mean that LLM inference could go down significantly in price and/or context length would dramatically increase?
bluecoconut•59m ago
I almost feel like this goes opposite to what attention is good at. This would be good at approximating all the places where attention is low / not sharp. Where attention/the exponential is key is when it selects out / needle-in-haystack / winner-takes-all focus (the word "attention" itself sorta implies this), and this is where the taylor expression would fail to represent the values well. This just... softens attentions ability to attend?

(I'm imagining that if in the context there's ~4-8 "similar" attention-targets that should be sharp, and regular attention learns to select the correct one, this taylor approximation version would wash out any difference and they'd all loosly be attended to, and it'd fail to isolate the correct signal)

Really wish this had some downstream tests -- apply it to a pretrained model and see how performance degrades, train a fresh one, etc. The tests are worth doing, but I somehow don't feel that hopeful this is the unlock required for sub-quadratic attention. It's possible that a freshly trained model with this learns to attend without the sharp attention signals, but that seems a bit dubious to me.

But also, maybe this combined with some other selective (sparse attention) trick, means that the hybrid model gets the "fuzzy long tail" of attention well represented as well as the sharpness well represented, and all together it could actually be a part of the larger solution.

energy123•53m ago
> this is where the taylor expression would fail to represent the values well.

"In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution"

seanhunter•31m ago
I read that too, but I wondered whether elementwise error is the right metric. Surely the actual error metric should be to evaluate model performance for a conventional transformer model and then the same model with the attention mechanism replaced by this 4th order Taylor approximation?
mapontosevenths•53m ago
> This just... softens attentions ability to attend?

I think this does soften, but not linearly. That is to say the fixed state size limitation means that it softens more as it gets further into the past.

tehsauce•36m ago
Right, and when they compare to floating point accuracy they seem to be using the number of decimals supported by the mantissa, but the exponent is important no?
seanhunter•22m ago
When someone says the error is of a certain magnitude they mean the absolute value of the difference between the the two things, so what they're saying is that the values they produced with their approximation are about as wrong as the difference between the actual values and those values cast to float16. The exponent is most definitely important and would be included in that.
mapontosevenths•55m ago
This uses the Taylor approximation to approximate softmax, but that IS only an approximation. I wonder exactly how much that trade-off costs in terms of accuracy vs performance? I note that they say it's close to float16 with four Taylor terms.

My other concern would be that Taylor itself is fairly complex. I wonder how well GPU's handle this in comparison to good old fashioned softmax? The last time I used Taylor with a custom Triton kernel it was still very slow. That could just have been my own jank vibe-coded implementation though.

rvz•50m ago
> Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.

Now this is a very interesting paper, which hopefully should address the chronic inefficiencies of the AI lack of efficient methods and approaches in reducing their significant computational and energy demands which are off the charts.

> These factors penalize performance relative to what a fused, hardware-optimized implementation could achieve, and the reported runtime results should therefore be interpreted conservatively.

It's still early with several limitations, but the need for wasting billions on GPUs will begin to not make any sense soon.

thomasahle•47m ago
There's a graveyard of 100s of papers with "approximate near linear time attention."

They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.

Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214

cubefox•37m ago
I think DeepSeek V3.2 is sub n^2, but it clearly performs quite well, refuting the alleged lower bounds in the paper.
andy12_•7m ago
It really isn't sub N^2. The main attention is only O(Nk), but only thanks to a lightning indexer that still has complexity O(N^2). So overall it still has the same complexity; just with a smaller constant factor [1]

> DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus

[1] https://arxiv.org/pdf/2512.02556

fheinsen•21m ago
When linear approximation error is of similar magnitude as quadratic numerical error, don’t the two become comparable? In practice, attention is computed with low-precision (4-bit, 8-bit, 16-bit) floats. Numerical error, in fact, may be a plausible explanation as to why quadratic attention, in practice, exhibits "context rot" as context gets longer, very much like an RNN.
cobolexpert•15m ago
Dumb question: is the quadratic time complexity for training, inference, or both?
omneity•13m ago
Attention is calculated during the forward pass of the model, which happens in both inference (forward only) and training (forward & backward).
kristjansson•5m ago
> self-attention is efficiently computable to arbitrary precision with constant cost per token

This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.

naasking•1m ago
I think any kind of innovation here will have to take advantage of some structure inherent to the problem, like eliminating attention in favour of Grassman flows [1].

[1] Attention Is Not What You Need, https://arxiv.org/abs/2512.19428

observationist•42m ago
This could turbocharge ByT5 and other tokenless architectures, whose big downside was the increase in compute over longer sequences. It's easy to imagine a bunch of strategies with variable levels of "focus" and so on with a fixed compute budget assigned on the fly with learned optimizers informing the distribution.
andes314•24m ago
Linear time attention doesn’t work, by principle. Dead end pursuit. Much great research on more efficient quadratic time inference
abeppu•21m ago
I haven't tried to follow the math closely but should there not be some concern about the region of convergence? It looks like they don't specifically discuss it. Or is there some reason this isn't a problem in this context?
reactordev•20m ago
I fear they have completely overlooked it.
alyxya•10m ago
The best and proven linear attention is the Gated DeltaNet or variations of it, used by Kimi and Qwen. Anyone who thinks linear attention can't work is forgetting that models are a fixed size so attention should always be compressable to be linear. Another way to think of the feasibility of linear attention is that the standard attention mechanism can be made linear simply by removing the softmax so the kv cache stores the kv product as a constant size matrix instead. Softmax just normalizes attention, but it's not theoretically required.

WebCad – free browser-based CAD with AI (export STEP)

https://app.webcad.ca/
1•tonio67•1m ago•1 comments

Show HN: Backseat Writer – AI pair writing

https://backseat-writer.vercel.app/demo
1•Dansvidania•2m ago•0 comments

Show HN: Implementation of Google's PaperBanana (diagram generation from text)

https://github.com/llmsresearch/paperbanana
1•dippatel1994•4m ago•0 comments

Clean Coder: The Dark Path (2017)

https://blog.cleancoder.com/uncle-bob/2017/01/11/TheDarkPath.html
1•andrewjf•5m ago•1 comments

What Do You Think of My Business Idea? (Claude Ad) [video]

https://www.youtube.com/watch?v=De-_wQpKw0s
2•eamag•8m ago•0 comments

Show HN: Grok Imagine – High-fidelity FLUX.1 generation with cinematic video

https://grok-imagine.me/
1•thenextechtrade•8m ago•0 comments

Show HN: Seren – Serverless Postgres, Rust SDK, CLI, & MCP Server for AI Agents

https://github.com/serenorg/seren
2•taariqlewis•9m ago•0 comments

Recursive Knowledge Synthesis for Multi-LLM Systems

https://arxiv.org/abs/2601.08839
1•bob1029•11m ago•0 comments

Microsoft's Pivotal AI Product Is Running into Big Problems

https://www.wsj.com/tech/ai/microsofts-pivotal-ai-product-is-running-into-big-problems-ce235b28
3•fortran77•12m ago•1 comments

Even after cutting EV incentives, Norway only sold 98 diesel cars in January

https://electrek.co/2026/02/03/even-after-cutting-ev-incentives-norway-only-sold-98-diesel-cars-i...
3•ceejayoz•13m ago•0 comments

Show HN: CuaBot – Co-op computer-use for any coding agent

https://github.com/trycua/cua
1•frabonacci•14m ago•0 comments

Forensic Photonics verifies digital evidence with Content Credentials

https://contentauthenticity.org/blog/how-forensic-photonics-verifies-digital-evidence-with-conten...
1•hasheddan•15m ago•0 comments

DuoBolt – a review-first duplicate file finder powered by BLAKE3

https://duobolt.app/
2•r9ne•16m ago•0 comments

LibreQoS: Online Bufferbloat Test

https://bufferbloat.libreqos.com/
1•goodburb•16m ago•0 comments

Why the Future of Movies Lives on Letterboxd

https://www.nytimes.com/interactive/2026/02/03/magazine/letterboxd-film-discussion-site-streaming...
1•mitchbob•16m ago•1 comments

How do you validate AI-generated data transformations before prod?

https://www.yorph.ai
1•areddyfd•16m ago•1 comments

If AI Writes the Code, What Should Engineers Learn?

https://the-learning-agency.com/the-cutting-ed/article/if-ai-writes-the-code-what-should-engineer...
2•selvaprakash•17m ago•0 comments

A programmable, Lego-like material for robots emulates life's flexibility

https://techxplore.com/news/2026-02-programmable-lego-material-robots-emulates.html
1•Brajeshwar•18m ago•0 comments

Anthropic Super Bowl Spot Skewers ChatGPT Ads

https://www.businessinsider.com/anthropic-super-bowl-openai-chatgpt-ads-claude-2026-2
2•tortilla•18m ago•0 comments

Physicists achieve near-zero friction on macroscopic scales

https://phys.org/news/2026-02-physicists-friction-macroscopic-scales.html
1•Brajeshwar•18m ago•0 comments

Pipe organ playing a single, nonstop song until 2640

https://www.popsci.com/technology/pipe-organ-one-song-2640/
1•Brajeshwar•18m ago•0 comments

SpaceX grounds Falcon 9 missions, could impact ISS launch

https://phys.org/news/2026-02-spacex-grounds-falcon-missions-impact.html
2•bookmtn•18m ago•0 comments

Show HN: Distr 2.0 – A year of learning how to ship to customer environments

https://github.com/distr-sh/distr
1•louis_w_gk•19m ago•0 comments

Show HN: Orpheus, An Agent runtime that scales on queue depth and not CPU

https://github.com/arpitnath/orpheus
3•arpitnath42•20m ago•0 comments

Anthropic Performance Team Take-Home for Dummies

https://www.ikot.blog/anthropic-take-home-for-dummies
2•vinhnx•21m ago•0 comments

A field guide to sandboxes for AI

https://www.luiscardoso.dev/blog/sandboxes-for-ai
1•Dangeranger•22m ago•0 comments

Show HN: Finding similarities in magazine covers (updated)

https://shoplurker.com/labs/img-compare/
1•tkp-415•23m ago•0 comments

We read the JSON Schema spec so you don't have to

https://blog.dottxt.ai/dotjson-has-good-schema-support.html
1•PaulHoule•23m ago•0 comments

Show HN: I built Clash to avoid conflicts when running AI agents in parallel

https://github.com/clash-sh/clash
1•matk9•25m ago•0 comments

Show HN: Non-Linear LLM Chats

https://www.mindbloom.so/
1•greenfieldday•26m ago•0 comments