How Attention Sinks Keep Language Models Stable

https://hanlab.mit.edu/blog/streamingllm

69•pr337h4m•6h ago

Comments

Havoc•4h ago

> The first few tokens often carried minimal semantic information—sometimes just a start-of-sequence marker or common words like "the" or "a."

I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting

optimalsolver•1h ago

"Magnets. How do they work?"

gjm11•29m ago

The heuristic doesn't work quite so well when applied to the actual original version of that line.

xg15•36m ago

Some people start their prompts with "Hello" or "Please" or something similar, out of some habitual sense of politeness, I think. It would be hilarious if those prompts really work better because the model can use those words as attention sinks.

Calavar•4h ago

> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.

This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].

[1] https://arxiv.org/abs/1712.02950

am17an•3h ago

This is nice and useful because the new GPT-OSS model uses this technique. Kudos to the original authors!

diggan•1h ago

And, as always, the FOSS ecosystem moves quickly, llama.cpp already fully support them! https://github.com/ggml-org/llama.cpp/pull/15157

esafak•2h ago

> Barbero et al. have shown that attention sinks serve as "pressure valves" preventing what researchers call "over-mixing"—a pathological state where deep models processing long sequences blur important distinctions between tokens. The presence of a sink draws attention away from other tokens, limiting the spread of information (and noise) and resulting in more stable embeddings.

This sounds like it is working for the wrong reasons. Surely the right behavior is for the right neurons to receive attention rather than the first handful. Jamming everything there is the complementary sin of blurring. I would investigate attention equalization paired with a sparsity prior or something similar to prevent blurring.

yorwba•2h ago

The point is that there's not always a right token to attend to. If the information you're looking for is not there, no clever attention scheme will find it. The best you can hope for when that happens is that the value returned in the "not found" case is distinguishable from the "found" case. Having an attention sink serve as a fixed "not found" value is one way to do this.

esafak•1h ago

Good point. Does that make them mitigate hallucinations?

yorwba•1h ago

In a sense? As the article notes, models trained using standard attention develop attention sinks naturally and removing them makes the model deteriorate completely, so the hallucinations you're thinking of were most likely output by a model that had already mitigated them in this way.

canjobear•1h ago

Seems like this was a better solution to the same problem https://www.evanmiller.org/attention-is-off-by-one.html

markisus•24m ago

Did this end up working? It sounds plausible but it needs some empirical validation.

Scene_Cast2•1h ago

I found a fairly large improvement in my toy transformer model where I added a "global" token akin to the CLS token in ViT.

Another approach I've seen is the "Diff transformer" from MS Research (https://github.com/microsoft/unilm/tree/master/Diff-Transfor...).

innerlee•24m ago

The singular defects (or high-norm tokens) [1] may be related to attention sinks. It is interesting that the direction of all high-norm tokens share the same direction. Maybe the theory behind is not very complex and the issue can be fixed cleverly during training.

[1] https://openreview.net/pdf?id=4yBnUokU2v

Ultrathin business card runs a fluid simulation

HorizonDB, a geocoding engine in Rust that replaces Elasticsearch

Getting Good Results from Claude Code

Astronomy Photographer of the Year 2025 shortlist

The Rise of Ritual Features: Why Platforms Are Adding Daily Puzzle Games

GPT-5

Window Activation

Linear sent me down a local-first rabbit hole

What Does Consulting Do?

Food, housing, & health care costs are a source of major stress for many people

Telefon Hírmondó: Listen to news and music electronically, in 1893

Show HN: Trayce – “Burp Suite for developers”

How Attention Sinks Keep Language Models Stable

Show HN: Synchrotron, a real-time DSP engine in pure Python

Flipper Zero dark web firmware bypasses rolling code security

Historical Tech Tree

Show HN: Aha Domain Search

Exit Tax: Leave Germany before your business gets big

Cursor CLI

Complex Iterators Are Slow

FLUX.1-Krea and the Rise of Opinionated Models

GPT-5: Key characteristics, pricing and system card

OpenAI's new open-source model is basically Phi-5

What Is Popover=Hint?

GPT-5 for Developers

Virtual Linux Devices on ARM64

The BLS Can't Be Replaced by the Private Sector

A love letter to my future employer (2020)

Turn any website into an API

Encryption made for police and military radios may be easily cracked

How Attention Sinks Keep Language Models Stable

Comments

Ultrathin business card runs a fluid simulation

HorizonDB, a geocoding engine in Rust that replaces Elasticsearch

Getting Good Results from Claude Code

Astronomy Photographer of the Year 2025 shortlist

The Rise of Ritual Features: Why Platforms Are Adding Daily Puzzle Games

GPT-5

Window Activation

Linear sent me down a local-first rabbit hole

What Does Consulting Do?

Food, housing, & health care costs are a source of major stress for many people

Telefon Hírmondó: Listen to news and music electronically, in 1893

Show HN: Trayce – “Burp Suite for developers”

How Attention Sinks Keep Language Models Stable

Show HN: Synchrotron, a real-time DSP engine in pure Python

Flipper Zero dark web firmware bypasses rolling code security

Historical Tech Tree

Show HN: Aha Domain Search

Exit Tax: Leave Germany before your business gets big

Cursor CLI

Complex Iterators Are Slow

FLUX.1-Krea and the Rise of Opinionated Models

GPT-5: Key characteristics, pricing and system card

OpenAI's new open-source model is basically Phi-5

What Is Popover=Hint?

GPT-5 for Developers

Virtual Linux Devices on ARM64

The BLS Can't Be Replaced by the Private Sector

A love letter to my future employer (2020)

Turn any website into an API

Encryption made for police and military radios may be easily cracked