Reverse-engineering retrieval in decoder-only Transformers

https://github.com/tmaselko/paper-attncap

2•tmaselko•1h ago

Comments

tmaselko•1h ago

I wanted to pick attention's head dimensions based on something other than vibes, so I reverse-engineered retrieval. I adapted MQAR into TSAR, "Tuple-Structured Associative Recall". Sequence positions become tuples with complete semantic meanings, letting me completely remove positional confounds. What I found was news to me, so I tidied it up, wrote it down, and put it in a repo.

In summary: Without positional confounds, Transformers are a powerhouse at retrieval. Length generalization is effortless. At or above 2, head dimension does not limit retrieval capacity at all. Retrieval is geometry-driven and contains three mechanisms: separation (of hidden state geometry into a dense spherical code), projection (of the code from the hidden state), and amplification (to sharpen/saturate softmax).

Some other fun implications:

- Models can represent features in dense spherical codes, not just orthogonal axes or superpositions.

- Retrieval heads appear to cripple their own gradients upon formation.

- Mainstream positional encodings aren't designed with retrieval in mind, and are antagonistic to it. Followup experiments hint that simply including a PE is catastrophic for retrieval.

- Length generalization failures should be mostly PEs warping the learned code so separations become alignments and alignments become separations.

- "Out-of-distribution" can be seen as "never accounted for in the spherical code". If it hasn't been seen it cannot be separated, and if it hasn't been separated it cannot be distinguished.

Preprint here: https://zenodo.org/records/19359748 (Still fishing for an arXiv endorsement...)

Github repo here: https://github.com/tmaselko/paper-attncap

You can replicate the headline results in five minutes on a 4090, or the whole paper in 20-30 hours if so inclined.

I'd be happy to answer any questions, I'm kinda starved for feedback on this.

Improving citation accuracy in GenAI with agentic highlight tool for local files

Next Grok model training with 10T parameter model

Bonsai 8B: a 1-bit LLM that fits in 1.15GB

AI agents as CRDT peers – building collaborative AI with Yjs

Confidential Inference

OneLivePage

A New Jersey Teen Finds Treasure, and More, in Abandoned Storage Units

Taskmaster

Show HN: I quit my job to sell garlic online

Browser, editor, and terminal. One app

Show HN: md.page – Your agent writes Markdown, you get a URL

Becoming Chief Technology Officer Wasn't a Promotion, It Was a Response

LLM-Kasten: a structured, persistent MD wiki CLI for agents

Teen Basketball Is for Pros

Ask HN: Why is email verification still treated as a separate workflow?

lmcli: Sleek and minimal terminal agentic coding

My Gratitude Jar – a gratitude journaling app to help remember the good times

Give LLMs a Thinking Medium

Cloud Networking Compared

What a Japanese cooking principle taught me about overcoming AI fatigue

The Rust CLI tool that sped up our test suite by 6x

Show HN: stagewise: The coding agent built for the web - OSS [video]

Google Workspace's Security Warning Was Just a Sales Pitch

Show HN: ArcadeDB Academy – 6 Free Database Courses with Certification

Show HN: Open-Source AI That Builds Screens, Not Just Text

Muse Spark: Scaling Towards Personal Superintelligence

Show HN: I built a personal corporation of AI agents that runs on your PC

Process Knowledge

Data in Use Protection: How MPC Keeps Inputs Hidden from the Cloud

Digital Hopes, Real Power: How the Arab Spring Fueled a Global Surveillance Boom