frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

How Attention Sinks Keep Language Models Stable

https://hanlab.mit.edu/blog/streamingllm
69•pr337h4m•6h ago

Comments

Havoc•4h ago
> The first few tokens often carried minimal semantic information—sometimes just a start-of-sequence marker or common words like "the" or "a."

I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting

optimalsolver•1h ago
"Magnets. How do they work?"
gjm11•29m ago
The heuristic doesn't work quite so well when applied to the actual original version of that line.
xg15•36m ago
Some people start their prompts with "Hello" or "Please" or something similar, out of some habitual sense of politeness, I think. It would be hilarious if those prompts really work better because the model can use those words as attention sinks.
Calavar•4h ago
> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.

This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].

[1] https://arxiv.org/abs/1712.02950

am17an•3h ago
This is nice and useful because the new GPT-OSS model uses this technique. Kudos to the original authors!
diggan•1h ago
And, as always, the FOSS ecosystem moves quickly, llama.cpp already fully support them! https://github.com/ggml-org/llama.cpp/pull/15157
esafak•2h ago
> Barbero et al. have shown that attention sinks serve as "pressure valves" preventing what researchers call "over-mixing"—a pathological state where deep models processing long sequences blur important distinctions between tokens. The presence of a sink draws attention away from other tokens, limiting the spread of information (and noise) and resulting in more stable embeddings.

This sounds like it is working for the wrong reasons. Surely the right behavior is for the right neurons to receive attention rather than the first handful. Jamming everything there is the complementary sin of blurring. I would investigate attention equalization paired with a sparsity prior or something similar to prevent blurring.

yorwba•2h ago
The point is that there's not always a right token to attend to. If the information you're looking for is not there, no clever attention scheme will find it. The best you can hope for when that happens is that the value returned in the "not found" case is distinguishable from the "found" case. Having an attention sink serve as a fixed "not found" value is one way to do this.
esafak•1h ago
Good point. Does that make them mitigate hallucinations?
yorwba•1h ago
In a sense? As the article notes, models trained using standard attention develop attention sinks naturally and removing them makes the model deteriorate completely, so the hallucinations you're thinking of were most likely output by a model that had already mitigated them in this way.
canjobear•1h ago
Seems like this was a better solution to the same problem https://www.evanmiller.org/attention-is-off-by-one.html
markisus•24m ago
Did this end up working? It sounds plausible but it needs some empirical validation.
Scene_Cast2•1h ago
I found a fairly large improvement in my toy transformer model where I added a "global" token akin to the CLS token in ViT.

Another approach I've seen is the "Diff transformer" from MS Research (https://github.com/microsoft/unilm/tree/master/Diff-Transfor...).

innerlee•24m ago
The singular defects (or high-norm tokens) [1] may be related to attention sinks. It is interesting that the direction of all high-norm tokens share the same direction. Maybe the theory behind is not very complex and the issue can be fixed cleverly during training.

[1] https://openreview.net/pdf?id=4yBnUokU2v

Ultrathin business card runs a fluid simulation

https://github.com/Nicholas-L-Johnson/flip-card
384•wompapumpum•3h ago•87 comments

HorizonDB, a geocoding engine in Rust that replaces Elasticsearch

https://radar.com/blog/high-performance-geocoding-in-rust
71•j_kao•2h ago•16 comments

Getting Good Results from Claude Code

https://www.dzombak.com/blog/2025/08/getting-good-results-from-claude-code/
39•ingve•1h ago•31 comments

Astronomy Photographer of the Year 2025 shortlist

https://www.rmg.co.uk/whats-on/astronomy-photographer-year/galleries/2025-shortlist
10•speckx•58m ago•0 comments

The Rise of Ritual Features: Why Platforms Are Adding Daily Puzzle Games

https://productpickle.online/2025/07/20/ritual-features-the-quiet-strategy-behind-daily-puzzle-games-on-linkedin-and-beyond/
33•pkancharla•2h ago•27 comments

GPT-5

https://openai.com/gpt-5/
1930•rd•22h ago•2303 comments

Window Activation

https://blog.broulik.de/2025/08/on-window-activation/
99•LorenDB•4d ago•43 comments

Linear sent me down a local-first rabbit hole

https://bytemash.net/posts/i-went-down-the-linear-rabbit-hole/
309•jcusch•9h ago•130 comments

What Does Consulting Do?

https://www.nber.org/papers/w34072
18•surprisetalk•1h ago•9 comments

Food, housing, & health care costs are a source of major stress for many people

https://apnorc.org/projects/food-housing-and-health-care-costs-are-a-source-of-major-stress-for-many-people/
177•speckx•3h ago•232 comments

Telefon Hírmondó: Listen to news and music electronically, in 1893

https://en.wikipedia.org/wiki/Telefon_H%C3%ADrmond%C3%B3
24•csense•3d ago•3 comments

Show HN: Trayce – “Burp Suite for developers”

https://trayce.dev?resubmit=hn
30•ev_dev3•1d ago•6 comments

How Attention Sinks Keep Language Models Stable

https://hanlab.mit.edu/blog/streamingllm
69•pr337h4m•6h ago•15 comments

Show HN: Synchrotron, a real-time DSP engine in pure Python

https://synchrotron.thatother.dev/
15•andromedaM31•2h ago•0 comments

Flipper Zero dark web firmware bypasses rolling code security

https://www.rtl-sdr.com/flipperzero-darkweb-firmware-bypasses-rolling-code-security/
410•lq9AJ8yrfs•18h ago•241 comments

Historical Tech Tree

https://www.historicaltechtree.com/
457•louisfd94•20h ago•102 comments

Show HN: Aha Domain Search

https://www.ahadomainsearch.com/
6•slig•3d ago•5 comments

Exit Tax: Leave Germany before your business gets big

https://eidel.io/exit-tax-leave-germany-before-your-business-gets-big/
324•olieidel•21h ago•405 comments

Cursor CLI

https://cursor.com/cli
342•gonzalovargas•18h ago•233 comments

Complex Iterators Are Slow

https://caolan.uk/notes/2025-07-31_complex_iterators_are_slow.cm
20•todsacerdoti•4d ago•8 comments

FLUX.1-Krea and the Rise of Opinionated Models

https://www.dbreunig.com/2025/08/04/the-rise-of-opinionated-models.html
45•dbreunig•3d ago•18 comments

GPT-5: Key characteristics, pricing and system card

https://simonwillison.net/2025/Aug/7/gpt-5/
588•Philpax•21h ago•256 comments

OpenAI's new open-source model is basically Phi-5

https://www.seangoedecke.com/gpt-oss-is-phi-5/
357•emschwartz•20h ago•191 comments

What Is Popover=Hint?

https://una.im/popover-hint/
40•speckx•4d ago•10 comments

GPT-5 for Developers

https://openai.com/index/introducing-gpt-5-for-developers
439•6thbit•22h ago•250 comments

Virtual Linux Devices on ARM64

https://underjord.io/500-virtual-linux-devices-on-arm64.html
36•lawik•4d ago•3 comments

The BLS Can't Be Replaced by the Private Sector

https://www.bloomberg.com/opinion/articles/2025-08-08/the-bls-can-t-be-replaced-by-the-private-sector
81•petethomas•2h ago•75 comments

A love letter to my future employer (2020)

https://catzkorn.dev/blog/love-letter/
49•luu•9h ago•12 comments

Turn any website into an API

https://www.parse.bot
63•pcl•10h ago•19 comments

Encryption made for police and military radios may be easily cracked

https://www.wired.com/story/encryption-made-for-police-and-military-radios-may-be-easily-cracked-researchers-find/
214•mikece•20h ago•134 comments