LLM Embeddings Explained: A Visual and Intuitive Guide

https://huggingface.co/spaces/hesamation/primer-llm-embedding

451•eric-burel•6mo ago

Comments

carschno•6mo ago

Nice explanations! A (more advanced) aspect which I find missing would be the difference between encoder-decoder transformer models (BERT) and "decoder-only", generative models, with respect to the embeddings.

dust42•6mo ago

Minor correction, BERT is an encoder (not encoder-decoder), ChatGPT is a decoder.

Encoders like BERT produce better results for embeddings because they look at the whole sentence, while GPTs look from left to right:

Imagine you're trying to understand the meaning of a word in a sentence, and you can read the entire sentence before deciding what that word means. For example, in "The bank was steep and muddy," you can see "steep and muddy" at the end, which tells you "bank" means the side of a river (aka riverbank), not a financial institution. BERT works this way - it looks at all the words around a target word (both before and after) to understand its meaning.

Now imagine you have to understand each word as you read from left to right, but you're not allowed to peek ahead. So when you encounter "The bank was..." you have to decide what "bank" means based only on "The" - you can't see the helpful clues that come later. GPT models work this way because they're designed to generate text one word at a time, predicting what comes next based only on what they've seen so far.

Here is a link also from huggingface, about modernBERT which has more info: https://huggingface.co/blog/modernbert

Also worth a look: neoBERT https://huggingface.co/papers/2502.19587

jasonjayr•6mo ago

As an extreme example that can (intentionally) confuse even human readers, see https://en.wikipedia.org/wiki/Garden-path_sentence

xxpor•6mo ago

Complete LLM internals noob here: Wouldn't this make GPTs awful at languages like German with separable word prefixes?

E.g. Er macht das Fenster. vs Er macht das Fenster auf.

(He makes the window. vs He opens the window.)

Ey7NFZ3P0nzAe•6mo ago

Or exceptionally good at german because they have to keep better track of what is meant and anticipate more?

No I don't think it makes any noticeable difference :)

xxpor•6mo ago

I'm probably way too English brained :D

ubutler•6mo ago

Further to @dust42, BERT is an encoder, GPT is a decoder, and T5 is an encoder-decoder.

Encoder-decoders are not in vogue.

Encoders are favored for classification, extraction (eg, NER and extractive QA) and information retrieval.

Decoders are favored for text generation, summarization and translation.

Recent research (see, eg, the Ettin paper: https://arxiv.org/html/2507.11412v1 ) seems to confirm the previous understanding that encoders are indeed better for “encoder task” and vice-versa.

Fundamentally, both are transformers and so an encoder could be turned into a decoder or a decoder could be turned into an encoder.

The design difference comes down to bidirectional (ie, all tokens can attend to all other tokens) versus autoregressive attention (ie, the current token can only attend to the previous tokens).

microtonal•6mo ago

Until we got highly optimized decoder implementations, decoders for prefill were often even implemented by using the same implementation as an encoder, but logit-masking inputs using a causal mask before the attention softmax so that tokens could not attend to future tokens.

namibj•6mo ago

You can use an encoder style architecture with decoder style output heads up top for denoising diffusion mode mask/blank filling. They seem to be somewhat more expensive on short sequences than GPT style decoder-only models when you batch them, as you need fewer passes over the content and until sequence length blows up your KV cache throughout cost, fewer passes are cheaper. But for situations that don't get request batching or where the context length is so heavy that you'd prefer to get to exploit memory locality on the attention computation, you'd benefit from diffusion mode decoding.

A nice side effect of the diffusion mode is that it's natural reliance on the bidirectional attention from the encoder layers provides much more flexible (and, critically, context-aware) understanding so as mentioned, later words can easily modulate earlier words like with "bank [of the river]"/"bank [in the park]"/"bank [got robbed]" or the classic of these days: telling an agent it did wrong and expecting it to in-context learn from the mistake (in practice decoder-only models basically merely get polluted from that, so you have to re-wind the conversation, because the later correction has literally no way of backwards-affecting the problematic tokens).

That said, the recent surge in training "reasoning" models to utilize thinking tokens that often get cut out of further conversation context, and all via a reinforcement learning process that's not merely RLHF/preference-conditioning, is actually quite related: discrete denoising diffusion models can be trained as a RL scheme during pre training where the training step is provided the outcome goal and a masked version as the input query, and then trained to manage the work done in the individual steps on it's own to where it eventually produces the outcome goal, crucially without prescribing any order of filling in the masked tokens or how many to do in which step.

A recent paper on the matter: https://openreview.net/forum?id=MJNywBdSDy

petesergeant•6mo ago

I wrote a simpler explanation still, that follows a similar flow, but approaches it from more of a "problems to solve" perspective: https://sgnt.ai/p/embeddings-explainer/

k__•6mo ago

Awesome, thanks!

If I understand this correctly, there are three major problems with LLMs right now.

1. LLMs reduce a very high-dimensional vector space into a very low-dimensional vector space. Since we don't know what the dimensions in the low-dimensional vector space mean, we can only check that the outputs are correct most of the time.

What research is happening to resolve this?

2. LLMs use written texts to facilitate this reduction. So, they don't learn from reality, but from what humans written down about reality.

It seems like Keen Technologies tries to avoid this issue, by using (simple) robots with sensors for training, instead of human text. Which seems a much slower process, but could yield more accurate models in the long run.

3. LLMs holds internal state as a vector that reflects the meaning and context of the "conversation". Which explains, why the quality of responses deteriorates with longer conversations, if one vector is "stamped over" again and again, the meaning of the first "stamps" will get blurred.

Are there alternative ways of holding state or is the only way around this to back up that state vector at every point an revert if things go awry?

agentcoops•6mo ago

Apologies if this comes across as too abstract, but I think your comment raises really important questions.

(1) While studying the properties of the mathematical objects produced is important, I don't think we should understand the situation you describe as a problem to be solved. In old supervised machine learning methods, human beings were tasked with defining the rather crude 'features' of relevance in a data/object domain, so each dimension had some intuitive significance (often binary 'is tall', 'is blue' etc). The question now is really about learning the objective geometry of meaning, so the dimensions of the resultant vector don't exactly have to be 'meaningful' in the same way -- and, counter-intuitive as it may seem, this is progress. Now the question is of the necessary dimensionality of the mathematical space in which semantic relations can be preserved -- and meaning /is/ in some fundamental sense the resultant geometry.

(2) This is where the 'Platonic hypothesis' research [1] is so fascinating: empirically we have found that the learned structures from text and image converge. This isn't saying we don't need images and sensor robots, but it appears we get the best results when training across modalities (language and image, for example). This is really fascinating for how we understand language. While any particular text might get things wrong, the language that human beings have developed over however many thousands of years really does seem to do a good job of breaking out the relevant possible 'features' of experience. The convergence of models trained from language and image suggests a certain convergence between what is learnable from sensory experience of the world and the relations that human beings have slowly come to know through the relations between words.

[1] https://phillipi.github.io/prh/ and https://arxiv.org/pdf/2405.07987

niam•6mo ago

Re: #2

I've never really challenged that text is a suitable stand-in for important bits of reality. I worry instead about meta-limitations of text: can we reliably scale our training corpus without accreting incestuous slop from other models?

Sensory bots would seem to provide a convenient way out of this problem but I'm not read-enough to know.

k__•6mo ago

1) Fair. I did some experiments with recommendation systems 15 years ago and we basically stopped using dimensions generated by the system, because nobody could make anything of them. The human-made dimensions were much easier to create user archetypes from.

visarga•6mo ago

> LLMs reduce a very high-dimensional vector space into a very low-dimensional vector space.

What do you mean? There is an embedding size that is maintained constant from the first layer to the last. Embedding lookup, N x transformer layers, softmax - all three of them have the same dimension.

Maybe you mean LoRA is "reducing a high-dimensional vector space into a lower-dimensional vector space"

k__•6mo ago

I mean LLMs reduce the "vector space" that describes reality into a vector space with fewer dimensions (e.g. 300 in the article I was replying to.)

hangonhn•6mo ago

Point 1 is such an interesting and perhaps profound observation about NNs in general (credit to both you and the original author). I had never thought of it that way but it seems to make intuitive sense.

visarga•6mo ago

Your approach is much more intuitive. I was coming back to say why didn't they show an embedding with categorical/scalar features?

tamasnet•6mo ago

Thanks for sharing this, I'm really enjoying the style and you've clarified some concepts in a clear way.

dotancohen•6mo ago

One of the first sentences of the page clearly states:

  > This blog post is recommended for desktop users.

That said, there is a lot of content here that could have been mobile-friendly with very little effort. The first image, of embeddings, is a prime example. It has been a very long time since I've seen any online content, let alone a blog post, that requires a desktop browser

fastball•6mo ago

> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

https://news.ycombinator.com/newsguidelines.html

CamperBob2•6mo ago

Active malevolence in page design (for instance, look what this site does to your back button -- even Firefox can't make sense of it) is interesting, just because it is still fairly uncommon to see.

Simple incompetence, not so much. But whoever wrote this wanted to kick sand in the user's face.

stirfish•6mo ago

This is interesting, and I'm curious how it came to be that way.

>If your ears are more important than your eyes, you can listen to the podcast version of this article generated by NotebookLM.

It looks like an LLM would read it to you; I wonder if one could have made it mobile-friendly.

lynx97•6mo ago

Shameless plug: If you want to experiment with semantic search for the pages you visit: https://github.com/mlang/llm-embed-proxy -- a intercepting proxy as a `llm` plugin.

mdaniel•6mo ago

I was going to suggest removing the extraneous network hop to pure.md but based on the notice I presume this is actually a consumer of it, so driving traffic there is a feature? https://github.com/mlang/llm-embed-proxy/blob/master/llm_emb...

lynx97•6mo ago

This is really just a PoC. pure.md is a pragmatic solution, since it gives good results. I was looking at markitdown but didn't find a way to disable href targets (noisy) nor did my tests of youtube transcripts work with markitdown. Keeping it on my list to monitor. Whatever works best is going to be used.

smcleod•6mo ago

Seems to be down?

Lots of console errors with the likes of "Content-Security-Policy: The page’s settings blocked an inline style (style-src-elem) from being applied because it violates the following directive: “style-src 'self'”." etc...

nycdatasci•6mo ago

If you want to see many more than 50 words and also have an appreciation for 3D data visualization check out embedding projector (no affiliation): https://projector.tensorflow.org/

bob_theslob646•6mo ago

If someone enjoyed learning about this, where should I suggest they start to learn more about embeddings?

ayhanfuat•6mo ago

Vicki Boykis wrote a small book about it: https://vickiboykis.com/what_are_embeddings/

boulevard•6mo ago

This is a great visual guide! I’ve also been working on a similar concept focused on deep understanding - a visual + audio + quiz-driven lesson on LLM embeddings, hosted on app.vidyaarthi.ai.

https://app.vidyaarthi.ai/ai-tutor?session_id=C2Wr46JFIqslX7...

Our goal is to make abstract concepts more intuitive and interactive — kind of like a "learning-by-doing" approach. Would love feedback from folks here.

(Not trying to self-promote — just sharing a related learning tool we’ve put a lot of thought into.)

amelius•6mo ago

If LLMs are so smart, then why can't they run directly on 8bit ascii input rather than tokens based on embeddings?

pornel•6mo ago

This isn't about smarts, but about performance and memory usage.

Tokens are a form of compression, and working on uncompressed representation would require more memory and more processing power.

amelius•6mo ago

The opposite is true. Ascii and English are pretty good at compressing. I can say "cat" with just 24 bits. Your average LLM token embedding uses on the order of kilobits internally.

blutfink•6mo ago

The LLM can also “say” “cat” with few bits. Note that the meaning of the word as stored in your brain takes more than 24 bits.

amelius•6mo ago

No, an LLM really uses __much__ more bits per token.

First, the embedding typically uses thousands of dimensions.

Then, the value along each dimension is represented with a floating point number which will take 16 bits (can be smaller though with higher quantization).

blutfink•6mo ago

Of course an LLM uses more space internally for a token. But so do humans.

My point was that you compared how the LLM represents a token internally versus how “English” transmits a word. That’s a category error.

amelius•6mo ago

But humans we can feed ascii, whereas LLMs require token inputs. My original question was about that: why can't we just feed the LLMs ascii, and let it figure out how it wants to encode that internally, __implicitly__? I.e., we just design a network and feed it ascii, as opposed to figuring out an encoding in a separate step and feeding it tokens in that encoding.

cesarb•6mo ago

> But humans we can feed ascii, whereas LLMs require token inputs.

To be pedantic, we can't feed humans ASCII directly, we have to convert it to images or sounds first.

> My original question was about that: why can't we just feed the LLMs ascii, and let it figure out how it wants to encode that internally, __implicitly__? I.e., we just design a network and feed it ascii, as opposed to figuring out an encoding in a separate step and feeding it tokens in that encoding.

That could be done, by having only 256 tokens, one for each possible byte, plus perhaps a few special-use tokens like "end of sequence". But it would be much less efficient.

amelius•6mo ago

Why would it be less efficient, if the LLM would convert it to an embedding internally?

cesarb•6mo ago

Because each byte would be an embedding, instead of several bytes (a full word or part of a word) being a single embedding. The amount of time a LLM takes is proportional to the number of embeddings (or tokens, since each token is represented by an embedding) in the input, and the amount of memory used by the internal state of the LLM is also proportional to the number of embeddings in the context window (how far it looks back in the input).

pornel•6mo ago

You can have "cat" as 1 token, or you can have "c" "a" "t" as 3 tokens.

In either case, the tokens are a necessary part of LLMs. They have to have a differentiable representation in order to be possible to train effectively. High-dimensional embeddings are differentiable and are able to usefully represent "meaning" of a token.

In other words, the representation of "cat" in an LLM must be something that can be gradually nudged towards "kitten", or "print", or "excavator", or other possible meanings. This is doable with the large vector representation, but such operation makes no sense when you try to represent the meaning directly in ASCII.

amelius•6mo ago

True, but imagine an input that is ASCII, followed by some layers of NN that result in an embedded representation and from there the usual NN layers of your LLM. The first layers can have shared weights (shared between inputs). Thus, let the LLM solve the embedding problem implicitly. Why wouldn't this work? It is much more elegant because the entire design would consist of neural networks, no extra code or data treatment necessary.

pornel•6mo ago

The tokens are basically this, a result of precomputing and caching such layers.

mathis•6mo ago

This might be more pure, but there is nothing to be gained. On the contrary, this would lead to very long sequences for which self-attention scales poorly.

fragmede•6mo ago

The token for cat is 464 which is just 9 bits.

namibj•6mo ago

Google's ByT5 digests UTF-8 byte-by-byte.

montebicyclelo•6mo ago

Nice tutorial — the contextual vs static embeddings is the important point; many are familiar with word2vec (static), but contextual embeddings are more powerful for many tasks.

(However, there seems to be some serious back-button / browser history hijacking on this page.. Just scolling down the page appends a ton to my browser history, which is lame.)

joaquincabezas•6mo ago

The culprit seems to be: https://huggingface.co/spaces/hesamation/primer-llm-embeddin...

So someone, at some point, thought this was a feature

hxtk•6mo ago

I thought that the point of replaceState was precisely to avoid appending elements to the history, and instead replace the most recent one, so I think I must be missing something if that line causes lots of additional history items.

khalic•6mo ago

What a didactic and well built article! My thanks to the author

eric-burel•6mo ago

Author's profile on Huggingface: https://huggingface.co/hesamation HN mods suggested me to repost after a less successful share. I especially liked this article because the author goes through different types of embeddings rather than sticking to the definition.

khalic•6mo ago

Thank you very much my dear, he seems to have a way with words, his last post about context summarizes many concepts I have internalized but not formalized yet.

zmmmmm•6mo ago

It really surprises me that embeddings seem to be one of the least discussed parts of the LLM stack. Intuitively you would think that they would have enormous influence over the network's ability to infer semantic connections. But it doesn't seem that people talk about it too much.

KasianFranks•6mo ago

Agreed. Vector embeddings along with which distance calculations you choose.

gbacon•6mo ago

Tend to avoid Euclidean distance.

crystal_revenge•6mo ago

When the vectors are normalized to unit length cosine similarity and Euclidean distance are equivalent.

This an optimization that many vector dbs use in retrieval since it is typically much faster to compute Euclidean distance rather than cosine.

elite_barnacle•6mo ago

absolutely. the first time i learned more deeply about embeddings i was like "whoa... at least a third of the magic of LLMs comes from embeddings". Understanding that words were already semantically arranged in such a useful pattern demystified LLMs a little bit for me. they're still wonderous, but it feels like the curtain has been rolled back a tiny bit for me

ttul•6mo ago

The weird thing about high-dimensional spaces is that most values are orthogonal to each other and most are also very far apart. It’s remarkable that you can still cluster concepts using dimension-reduction techniques when there are 50,000 dimensions to play with.

gbacon•6mo ago

Cosine similarity is your friend.

nsingh2•6mo ago

Cosine similarity is the dot product of vectors that have been normalized to lie on the unit sphere. Normalization doesn't alter orthogonality, nor does it change the fact that most high‑dimensional vectors are (nearly) orthogonal.

samrus•6mo ago

Maybe cosine similarity isnt the sulver bullet, but going back to the point: why dont LLM embedding spaces suffer from the curse of dimensionality?

namibj•6mo ago

They do. It's just that for two vectors to be orthogonal it's the case as soon as they're orthogonal when projected down to any subspace; the latter means that if for example one coordinate is all they differ on, and it's inverse in that value between the two vectors, then these two vectors _are already orthogonal._

In d dimensions you can have d vectors that are mutually orthogonal.

Interestingly this means that for sequence lengths up to d, you can have precise positional targeting attention. As soon as you go to longer sequences that's no longer universally possible.

roadside_picnic•6mo ago

It would be weird if the points in the embedding space where uniformly distributed but they're not. The entire role of the model in general is to project those results on to a subset of the larger space, that "makes sense" for the problem. Ultimately the projection becomes one in which the latent categories we're trying to predict (class, token, etc) become linearly separable.

mbowcut2•6mo ago

The problem with embeddings is that they're basically inscrutable to anything but the model itself. It's true that they must encode the semantic meaning of the input sequence, but the learning process compresses it to the point that only the model's learned decoder head knows what to do with it. Anthropic's developed interpretable internal features for Sonnet 3 [1], but from what I understand that requires somewhat expensive parallel training of a network whose sole purpose is attempt to disentangle LLM hidden layer activations.

[1] https://transformer-circuits.pub/2024/scaling-monosemanticit...

samrus•6mo ago

I mean thats true for all DL layers, but we talk about convolutions and stuff often enough. Embedding are relatively new but theres not alot of discussion as to how crazy they are, especially given that they are the real star of the LLM, with transformers being a close second imo

TZubiri•6mo ago

Can't you decode the embeddings to tokens for debugging?

freeone3000•6mo ago

You can but this is lossy (as it drops context; it’s a dimensionality reduction from 512 or 1024 to a few bytes) and non-reconvertible.

gbacon•6mo ago

I found decent results using multiclass spectral clustering to query embedding spaces.

https://ieeexplore.ieee.org/document/10500152

https://ieeexplore.ieee.org/document/10971523

visarga•6mo ago

You can search the closest matching words or expressions in a dictionary. It is trivial to understand where an embedding points to.

hangonhn•6mo ago

Can you do that in the middle of the layers? And if you do, would that word be that meaningful to the final output? Genuinely curious.

mbowcut2•6mo ago

You can, and there has been some interesting work done with it. The technique is called LogitLens, and basically you pass intermediate embeddings through the LMHead to get logits corresponding to tokens. In this paper they use it to investigate whether LLMs have a language bias, i.e. does GPT "think" in English? https://arxiv.org/pdf/2408.10811

One problem with this technique is that the model wasn't trained with intermediate layers being mapped to logits in the first place, so it's not clear why the LMHead should be able to map them to anything sensible. But alas, like everything in DL research, they threw science at the wall and a bit stuck.

spmurrayzzz•6mo ago

Very much agree re: inscrutability. It gets even more complicated when you add the LLM-specific concept of rotary positional embeddings to the mix. In my experience, it's been exceptionally hard to communicate that concept to even technical folks that may understand (at a high level) the concept of semantic similarity via something like cosine distance.

I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far).

krackers•6mo ago

I've read that "No Position Embedding" seems to be better for long-term context anyway, so it's probably not something essential to explain.

spmurrayzzz•6mo ago

Do you have a citation for the paper on that? IME, that's not really something you see used in practice, at least not after 2022 or so. Without some form positional adjustment, transformer-based LLMs have no way to differentiate from "The dog bit the man." and "The man bit the dog." given the token ids are nearly identical. You just end up back in the bag-of-words problem space. The self-attention mechanism is permutation-invariant, so as long as it remains true that the attention scores are computed as an unordered set, you need some way to model the sequence.

Long context is almost always some form of RoPE in practice (often YaRN these days). We can't confirm this with the closed-source frontier models, but given that all the long context models in the open weight domain are absolutely encoding positional data, coupled with the fact that the majority of recent and past literature corroborates its use, we can be reasonably sure they're using some form of it there as well.

EDIT: there is a recent paper that addresses the sequence modeling problem in another way, but its somewhat orthogonal to the above as they're changing the tokenization method entirely https://arxiv.org/abs/2507.07955

krackers•6mo ago

The paper showing that dropping positional encoding entirely is feasible is https://arxiv.org/pdf/2305.19466 . But I was misremembering as to its long context performance, Llama 4 does use NoPE but it's still interleaved with RoPE layers. Just an armchair commenter though, so I may well be wrong.

My intuition for NoPE was that the presence of the causal mask provides enough of a signal to implicitly distinguish token position. If you imagine the flow of information in the transformer network, tokens later on in the sequence "absorb" information from the hidden states of previous tokens, so in this sense you can imagine information flowing "down (depth) and to the right (token position)", and you could imagine the network learning a scheme to somehow use this property to encode position.

spmurrayzzz•6mo ago

Ah didn't realize you were referring to NoPE explicitly. And yea, the intuitions gained from that paper are pretty much what I alluded to above, you don't get away with never modeling the positional data, the question is how you model it effectively and from where do you derive that signal.

NoPE never really took off more broadly in modern architecture implementations. We haven't seen anyone successfully reproduce the proposed solution to the long context problem presented in the paper (tuning the scaling factor in the attention softmax).

There is a recent paper back in December[1] that talked about the idea of positional information arising from the similarity of nearby embeddings. Its again in that common research bucket of "never reproduced", but interesting. It does sound similar in spirit though to the NoPE idea you mentioned of the causal mask providing some amount of position signal. i.e. we don't necessarily need to adjust the embeddings explicitly for the same information to be learned (TBD on whether that proves out long term).

This all goes back to my original comment though of communicating this idea to AI/ML neophytes being challenging. I don't think skipping the concept of positional information actually makes these systems easier to comprehend since its critically important to how we model language, but its also really complicated to explain in terms of implementation.

[1] https://arxiv.org/abs/2501.00073

gavmor•6mo ago

> learned decoder head

That's a really interesting three-word noun-phrase. Is it a term-of-art, or a personal analogy?

kianN•6mo ago

This is exactly the challenge. When embedding were first popularized in word to vec they were interpretable because the word2vec model was revealed to be a batched matrix factorization [1].

LLM embedding are so abstract and far removed from a human interpretable or statistical corollary that even as the embeddings contain more information, that information becomes less accessible to humans.

[1] https://papers.nips.cc/paper_files/paper/2014/hash/b78666971...

calibas•6mo ago

It also seems odd to me. The embeddings are a kind of "Rosetta Stone" that allows a computer to quantify human language.

They should be a really big deal! Though I can see why trying to comprehend a 1,000-dimensional vector space might be intimidating.

stevenhuang•6mo ago

Not sure what you're on about. Embeddings have been talked about here at length since day 1 especially with RAG applications and vector dbs

roadside_picnic•6mo ago

I had the same reaction to this comment. At least in my experience in this area, embeddings are heavily discussed and used. At this point, for most traditional NLP tasks involving a vector representation of a text, LLM embeddings are generally a good place to start.

TZubiri•6mo ago

I tried the openai embeddings model, but it seemed very old and uncared for, like a 2023 release iirc?

Also the results were not great. Are there any good embeddings api providers?

ryneandal•6mo ago

I haven't done exhaustive testing of all top-performing models on the HF Embedding Leaderboard (https://huggingface.co/spaces/mteb/leaderboard) but I have tested a number of them extensively in the past month or two. The two best API provider models I've tested are:

- JinaAI (https://jina.ai/embeddings/) v3 and v4 performed well in my testing. - Google's Gemini-001 model (https://ai.google.dev/gemini-api/docs/models#gemini-embeddin...).

Overall, both were surpassed by Qwen3-8b (https://huggingface.co/Qwen/Qwen3-Embedding-8B).

Note, this was specifically regarding English and Code embedding generation/retrieval, with reranking.

visarga•6mo ago

I think it is more informative to simply visualize a word cloud or even to show top-k results for a query.

Something like https://projector.tensorflow.org/

just type a word in, select UMAP projection.

xg15•6mo ago

> While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training.

So out of interest: During inference, the embedding is simply a lookup table "token ID -> embedding vector". Mathematically, you could represent this as encoding the token ID as a (very very long) one-hot vector, then passing that through a linear layer to get the embedding vector. The linear layer would contain exactly the information from the lookup table.

My question: Is this also how the embeddings are trained? I.e. just treat them as a linear layer and include them in the normal backpropagation of the model?

montebicyclelo•6mo ago

So, they are included in the normal backpropagation of the model. But there is no one-hot encoding, because, although you are correct that it is equivalent, it would be very inefficient to do it that way. You can make indexing differentiable, i.e. gradient descent flows back to the vectors that were selected, which is more efficient than a one-hot matmul.

(If you're curious about the details, there's an example of making indexing differentiable in my minimal deep learning library here: https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...)

xg15•6mo ago

Ah, that makes sense, thanks a lot!

asjir•6mo ago

To expand upon the other comment: Indexing and multiplying with one-hot embeddings are equivalent.

IF N is vocab size and L is sequence length, you'd need to create a NxL matrix, and multiply it with the embedding matrix. But since your NxL matrix will be sparse with only a single 1 per column, it'd make sense to represent it internally as just one number per column, representing the index at which 1 is. At which point if you defined new multiplication by this matrix, it would basically just index with this number.

And just like you write a special forward pass, you can write a special backward pass so that backpropagation would reach it.

vkawasth•6mo ago

One can visulaize how embeddings transform using Alpha Complexes. https://www.preprints.org/manuscript/202505.0097/v1

As embeddings transfer through various layers, you can see what contribution each layer of transformer is making to classification. There are 3 types of holes that form 1-d, 2-d 3-d... each is telling the shape of data (embedding) as it traverses... It can help is reducing layers/reducing backprop. Some layers are more important than others...

You will see none of this using Vietoris Rips!

k90k90k90•6mo ago

https://g.co/gemini/share/893c0a4af623

in case you want to play and visually understand the traditional PE;

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

The AI boom is causing shortages everywhere else

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Vocal Guide – belt sing without killing yourself

Selection Rather Than Prediction

Speed up responses with fast mode

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Software factories and the agentic moment

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Show HN: I spent 4 years building a UI design tool with only the features I use

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

The AI boom is causing shortages everywhere else

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Vocal Guide – belt sing without killing yourself

Selection Rather Than Prediction

Speed up responses with fast mode

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Software factories and the agentic moment

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Show HN: I spent 4 years building a UI design tool with only the features I use

LLM Embeddings Explained: A Visual and Intuitive Guide

Comments