If you're interested in machine learning at all and not very strong regarding kernel methods I highly recommending taking a deep dive. Such a huge amount of ML can be framed through the lens of kernel methods (and things like Gaussian Processes will become much easier to understand).
0. https://web.archive.org/web/20250820184917/http://bactra.org...
I'll make a note to read up on kernels some more. Do you have any other reading recommendations for doing that?
They derive Q, K, V as a continuous analog of a hopfield network
The Free Transformer: https://arxiv.org/abs/2510.17558
Abstract: We propose an extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision thanks to a variational procedure. Experimental evaluations show that allowing such a conditioning translates into substantial improvements on downstream tasks.
https://web.archive.org/web/20230713101725/http://bactra.org...
http://bactra.org/notebooks/nn-attention-and-transformers.ht...
Surprisingly, reading this piece helped me better understand the query, key metaphor.
I find attention much easier to understand in the original attention paper [0], which focuses on cross-attention for machine translation. In translation, the input sentence to be translated is tokenized into vectors {x_1...x_n}. The translated sentence is autoregressively generated into tokens {y_1...y_m}. To generate y_j, the model computes a similarity score of the previously generated token y_{j-1} against every x_i via the dot product s_{i,j} = x_i*K*y_{j-1}, transformed by the Key matrix. These are then softmaxed to create a weight vector a_j = softmax_i(s_{i,j}). The weighted average of X = [x_1|...|x_n] is taken with respect to a_j and transformed by the Value matrix, i.e. c_j = V*X*a_j. c_j is then passed to additional network layers to generate the output token y_j.
tl;dr: given the previous output token, compute its similarity to each input token (via K). Use those similarity scores to compute a weighted average across all input tokens, and use that weighted average to generate the next output token (via V).
Note that in this paper, the Query matrix is not explicitly used. It can be thought of as a token preprocessor: rather than computing s_{i,j} = x_i*K*y_{j-1}, each x_i is first linearly transformed by some matrix Q. Because this paper used an RNN (specifically, an LSTM) to encode the tokens, such transformations on the input tokens are implicit in each LSTM module.
[0] https://arxiv.org/pdf/1508.04025 (predates "Attention is all you need" by 3 years)
Also forget the terms "query", "key" and "value", or vague analogies to key-value stores, that is IMO a largely false analogy, and certainly not a helpful way to understand what is happening.
Am I the only one who thinks it's not obvious the "it" refers to the mat? The cat could be sitting on the mat because the cat is comfortable
Many sentences require you to have some knowledge of the world to process. In this case, you need to have the knowledge that "being comfortable dictates where you sit" doesn't happen nearly as often as "where you sit dictates your comfort."
Even for humans NLP is probabilistic, which is why we still often get it wrong. Or at least I know that I do.
Although the sentence is itself a bit awkward and strange on its own, and really needs context. In fact, this is because the sentence is generated as a short example to make a point about attention and tokens, and is not really something someone would utter naturally in isolation.
I mostly just wanted to playfully comment that original GP / top-level comment had a valid point about the ambiguity!
Imagine the input text as though it were the whole internet and each page is just 1 token. Your job is to build a neural-network Google results page for that mini internet of tokens.
In traditional search, we are given a search query, and we want to find web pages via an intermediate search results page with 10 blue links. Basically, when we're Googling something, we want to know "What web pages are relevant to this given search query?", and then given those links we ask "what do those web pages actually say?" and click on the links to answer our question. In this case, the "Query" is obviously the user search query, the "Key" is one of the ten blue links (usually the title of the page), and the "Value" is the content of the web page that link goes to.
In the attention mechanism, we are given a token and we want to find its meaning when contextualized with other tokens. Basically, we are first trying to answer the question "which other tokens are relevant to this token?", and then given the answer to that we ask "what is the meaning of the original token given these other relevant tokens?" The "Query" is a given token in the input text, the "Key" is another token in the input text, and the "Value" is the final meaning of the original token with that other token in context (in the form of an embedding). For a given token, you can imagine it is as though the attention mechanism "clicked the 10 blue links" of the other most relevant tokens in the input and combined them in some way to figure out the meaning of the original query token (and also you might imagine we ran such a query in parallel for every token in the input text at the same time).
So the self attention mechanism is basically google search but instead of a user query, it's a token in the input, instead of a blue link, it's another token, and instead of a web page, it's meaning.
Heck, attention layers never even see tokens. Even the first self-attention layer sees positional embeddings, but all subsequent attention layers are just seeing complicated embeddings that are a mish-mash of the previous layers' embeddings.
A little side project I've been working on is to train a model that sits on top of the LLM, looks at each key and determines whether it's needed after a certain lifespan, and evicts it if possible (after the lifespan is expired). Still working on it, but my first pass test has a reduction of 90% of the keys!
libraryofbabel•11h ago
For anyone serious about coming to grips with this stuff, I would strongly recommend Sebastian Raschka's excellent book Build a Large Language Model (From Scratch), which I just finished reading. It's approachable and also detailed.
As an aside, does anyone else find the whole "database lookup" motivation for QKV kind of confusing? (in the article, "Query (Q): What am I looking for? Key (K): What do I contain? Value (V): What information do I actually hold?"). I've never really got it and I just switched to thinking of QKV as a way to construct a fairly general series of linear algebra transformations on the input of a sequence of token embedding vectors x that is quadratic in x and ensures that every token can relate to every other token in the NxN attention matrix. After all, the actual contents and "meaning" of QKV are very opaque: the weights that are used to construct them are learned during training. Furthermore, there is a lot of symmetry between Q and K in the algebra, which gets broken only by the causal mask. Or do people find this motivation useful and meaningful in some deeper way? What am I missing?
[edit: on this last question, the article on "Attention is just Kernel Smoothing" that roadside_picnic posted below looks really interesting in terms of giving a clean generalized mathematical approach to this, and also affirms that I'm not completely off the mark by being a bit suspicious about the whole hand-wavy "database lookup" Queries/Keys/Values interpretation]
ebbi•11h ago
libraryofbabel•11h ago
The best part about it is seeing the code built up for the GPT-2 architecture in basic pytorch, and then loading in the real GPT-2 weights and they actually work! So it's great for learning but also quite realistic. It's LLM architecture from a few years ago (to keep it approachable), but Sebastian has some great more advanced material on modern LLM architectures (which aren't that different) on his website and in the github repo: e.g. he has a whole article on implementing the Qwen3 architecture from scratch.
ebbi•11h ago
libraryofbabel•11h ago
kouteiheika•6h ago
This might be underselling it a little bit. The difference between GPT2 and Qwen3 is maybe, I don't know, ~20 lines of code difference if you write it well? The biggest difference is probably RoPE (which can be tricky to wrap your head around); the rest is pretty minor.
libraryofbabel•6h ago
mnicky•11h ago
libraryofbabel•11h ago
mnicky•4h ago
If the assymetry of K and Q stems from the direction of the softmax application, it must also be the reason for the names of the matrices :)
And if you think about it, it makes sense that for each Key, weights to all of the Queries sum to 1 and not vice versa.
So this is my only intuition for the K and Q names.
(It may or may not be similar to the whole "db lookup thing"... I just don't use that one.)
ebonnafoux•4h ago
yorwba•49m ago
p1esk•11h ago
D-Machine•9h ago
[1] https://www.emergentmind.com/topics/merged-attention
[2] https://blog.google/innovation-and-ai/technology/developers-...
[3] https://arxiv.org/abs/2111.07624
p1esk•8h ago
Do you see any issues with my interpretation of them?
D-Machine•8h ago
Your terms "sensitivity", "visibility", and "important" are too vague and lack any clear mathematical meaning, so IMO add nothing to any understanding. "Important" also seems factually wrong, given these layers are stacked, so later weights and operations can in fact inflate / reverse things. Deriving e.g. feature importances from self-attention layers remains a highly disputed area (e.g. [1] vs [2], for just the tip of the iceberg).
You are also assuming that the importance of attention is the highly-specific QKV structure and projection, but there is very little reason to believe that based on the third review link I shared. Or, if you'd like another example of why not to focus so much on scaled dot-product attention, see that it is just a subset of a broader category of multiplicative interactions (https://openreview.net/pdf?id=rylnK6VtDH).
[1] Attention is not Explanation - https://arxiv.org/abs/1902.10186
[2] Attention is not not Explanation - https://arxiv.org/abs/1908.04626
p1esk•6h ago
2. I don't see how the transformations done in one attention block can be reversed in the next block (or in the FFN network immediately after the first block): can you please explain?
3. All state of the art open source LLMs (DeepSeek, Qwen, Kimi, etc) still use all three QKV projections, and largely the same original attention algorithm with some efficiency tweaks (grouped query, MLA, etc) which are done strictly to make the models faster/lighter, not smarter.
4. When GPT2 came out, I myself tried to remove various ops from attention blocks, and evaluated the impact. Among other things I tried removing individual projections (using unmodified input vectors instead), and in all three cases I observed quality degradation (when training from scratch).
5. The terms "sensitivity", "visibility", and "important" all attempt to describe feature importance when performing pattern matching. I use these terms in the same sense as importance of features matched by convolutional layer kernels, which scan the input image and match patterns.
D-Machine•5h ago
2. I didn't say the transformations can be reversed, I said if you interpret anything as an importance (e.g. a magnitude), that can be inflated / reversed by whatever weights are learned by later layers. Negative values and/or weights make this even more annoying / complicated.
3. Not sure how this is relevant, but, yes, any reasons for caring about QKV and scaled dot-product attention specifics are mostly related to performance and/or current popular leading models. But there is nothing fundamentally important about scaled dot-product attention, it most likely just happens to be something that was settled on prematurely because it works quite well and is easy to parallelize. Or, if you like the kernel smoothing explanation also mentioned in this thread, scaled dot-product self-attention implements something very similar to a particularly simple and nice form of kernel smoothing.
4. Yup, removing ops from scaled dot-product attention blocks is going to dramatically reduce expressivity, because there really aren't much ops there to remove. But there is enough work on low-rank attention, linear attentions, and sparse attentions, that show you can remove a lot of expressivity and still do quite well. And, of course, the huge amount of helpful other types of attention I linked before give gains in some cases too. You should be skeptical about any really simple or clear story about what is going on here. In particular, there is no clear reason why a small hypernetwork couldn't be used to approximate something more general than scaled dot-product attention, except that, obviously this is going to be more expensive, and in practice you can probably just get the same approximate flexibility by stacking simpler attention layers.
5. I still find that doesn't give me any clear mathematical meaning.
I suspect our learning goals are at odds. If you want to focus solely on the very specific kind of attention used in the popular transformer models today, perhaps because you are interested in optimizations or distillation or something, then by all means try to come up with special intuitions about Q, K, and V, if you think that will help here. But those intuitions will likely not translate well to future and existing modifications and improvements to attention layers, in transformers or otherwise. You will be better served learning about attention broadly and developing intuitions based on that.
Others have mentioned the kernel smoothing interpretation, and I think multiplicative interactions are the clearer deeper generalization of what is really important and valuable here. Also, the useful intuitions in DL have been less about e.g. "feature importances" and "sensitivity" and such, but tend to come more from linear algebra and calculus, and tend to involve things like matrix conditioning and regularization / smoothing and Lipschitz constants and the like. In particular, the softmax in self-attention is probably not doing what people typically say it does (https://arxiv.org/html/2410.18613v1), and the real point is that all these attention layers are trained in an end-to-end fashion where all layers are interdependent on each other to varying complicated degrees. Focusing on very specific interpretations ("Q is this, K is that"), especially where these interpretations are sort of vaguely metaphorical, like yours, is not likely to result in much deep understanding, in my opinion.
psb217•1h ago
Some of the attention-like ops proposed in this new work are most simply described as implementing the associative memory with a hypernetwork that maps keys to values with weights that are optimized at test time to minimize value retrieval error. Like you suggest, designing these hypernetworks to permit efficient implementations is tricky.
It's a more constrained interpretation of attention than you're advocating for, since it follows the "attention as associative memory" perspective, but the general idea of test-time optimization could be applied to other mechanisms for letting information interact non-linearly across arbitrary nodes in the compute graph.
[1] https://arxiv.org/abs/2501.00663
[2] https://arxiv.org/abs/2504.13173
[3] https://arxiv.org/abs/2505.23735
andoando•10h ago
For one, I have no idea how this relates to the mathematical operations of calculating attention score, applying softmax and than doing dot product with the V matrix.
Second just conceptually I don't understand how this relates to the "a word looks up to how relevant it is to another word". So if you have "The cat eats his soup", "his" queries how it's important it is to cat. So is V just numerical result of the significance, like 0.99?
I dont think Im very stupid but after seeing a dozens of these, I am starting to wonder if anyone actually understands this conceptually
empiricus•2h ago
D-Machine•9h ago
That's because what you say here is the correct understanding. The lookup thing is nonsense.
The terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases (e.g. cross-attention), where x and y are are outputs from previous layers.
Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations/similarities and/or multiplicative interactions among a dimension-reduced representation. EDIT: Or, as you say, it can be regarded as kernel smoothing.
libraryofbabel•9h ago
I’ll have to read up on merged attention, I haven’t got that far yet!
D-Machine•9h ago
A paper I found particularly useful on this was generalizing even farther to note the importance of multiplicative interactions more generally in deep learning (https://openreview.net/pdf?id=rylnK6VtDH).
EDIT: Also, this paper I was looking for dramatically generalizes the notion of attention in a way I found to be quite helpful: https://arxiv.org/pdf/2111.07624
ianand•5h ago
The analogy I prefer when teaching attention is celestial mechanics. Tokens are like planets in (latent) space. The attention mechanism is like a kind of "gravity" where each token is influencing each other, pushing and pulling each other around in (latent) space to refine their meaning. But instead of "distance" and "mass", this gravity is proportional to semantic inter-relatedness and instead of physical space this is occurring in a latent space.
https://www.youtube.com/watch?v=ZuiJjkbX0Og&t=3569s