A visual exploration of vector embeddings

http://blog.pamelafox.org/2025/05/a-visual-exploration-of-vector.html

133•pamelafox•1d ago

Comments

isjustintime•11h ago

I love the visual approaches used to explain these concepts. Words and math hurt my brain, but when accompanied by charts and diagrams, my brain hurts much less.

cratermoon•11h ago

If you like this you'll love Grant Sanderson's series on linear algebra and LLMs. https://www.youtube.com/watch?v=wjZofJX0v4M

minimaxir•11h ago

Since this was oriented toward a Python audience, it may have also been useful to demonstrate on the poster how in Python you can create the embeddings (e.g. using requests/OpenAI client and hitting OpenAI's embeddings API) and calculate the similarities (e.g. using numpy) since most won't read the linked notebooks. Mostly as a good excuse to showoff Python's rare @ operator for dot products in numpy.

As a tangent, what root data source are you using to calculate the movie embeddings?

pamelafox•8h ago

I thought I'd make this blog post be language-agnostic, but agreed that a Python-specific version would be helpful.

Here's where I calculate cosine without numpy: https://github.com/pamelafox/vector-embeddings-demos/blob/ma...

And in the distance notebook, I calculate with numpy: https://github.com/pamelafox/vector-embeddings-demos/blob/ma... I didn't use the @ operator! TIL.

I forget where I originally got the Disney movie titles, but it is notably just the titles. A better ranking would be based off a movie synopsis as well. Here's where I calculated their embeddings using OpenAI: https://github.com/pamelafox/vector-embeddings-demos/blob/ma...

Maybe I can submit a poster to Pytorch that would include the Python code as well.

godelski•9h ago

The more I've studied this stuff the less useful I actually think the vitalizations are. Pamela uses the classic approach and I'm not trying to call her wrong but I think our intuitions really fail us outside 2D and 3D.

Once you move up in dimensionality things get really messy really fast. There's a contraction in variance and the meaning of distance becomes much more fuzzy. You can't differentiate your nearest neighbor from your furthest. Angles get much harder too. Everything is orthogonal, in most directions too! I'm not all that surprised "god" and "dog". I EXPECT them to be. After all, they are the reverse of one another. The question rather is about "similar in which direction?"

There's no need to believe you've measured along a direction that is human meaningful. So doesn't have to be semantics. Doesn't have to be permutations either. Just like you can rotate your xy axis and travel in both directions.

So these things can really trick us. At best, be very careful to not become overly reliant upon them

pamelafox•8h ago

I actually explicitly left the PCA graphs out of the blog post version, as I think they lose so much information as to be deceiving. That's what I told folks in person at the poster session as well.

I think the other graphs I included aren't deceiving, they're just not quite as fun as an attempt to visualize the similarity space.

godelski•6h ago

Yeah PCA gets tough. It isn't great for non-linear relationships and I mean that's the whole reason we use activation functions haha. And don't get me started on how people refer to t-SNE as dimensionality reduction instead of visualization...

I don't think the other graphs are necessarily deceiving but I think they don't capture as much information as we often imply and I think this ends up leading people to make wrong assumptions about what is happening in the data.

Embeddings and immersions get really fucking weird at high dimensions. I mean it gets weird at like 4D and insane by 10D. The spaces we're talking about are incomprehensible. Every piece of geometric intuition you have should be thrown out the window. It won't help you, it harms you. If you start digging into the high dimensional statistics and metric theory for high dimensions you'll quickly see what I'm talking about. Like the craziness of Lp distances and contraction of variance. Like you have to really dig into why we prefer L1 over L2 and why even fractional ps are of interest. We run into all kinds of problems with i.i.d. assumptions and all that. It is wild how many assumptions are being made that we generally don't even think about. They seem obvious and natural to use, but they don't work very well when D>3. I do think the visualizations become useful again once you start getting used to this again but that's more like in the way that you are interpreting it with far less generalization in meaning.

I'm not trying to dunk on your post. I think it is fine. But I think our ML community needs to be having more conversations about these limits. We're really running into issues with them.

been-jammin•4h ago

I think visually so very helpful thanks. I also agree once you get into higher dimensionality it becomes difficult to represent visually. Nevertheless helpful for an 'old' (50) computer scientist wrapping my head around AI concepts so I can keep up with my team.

dleeftink•3h ago

I think the heatmap + dendogram approach can be useful for high dimensional comparisons (to a degree). Check out ClustViz for an interactive demo[0].

[0]: https://biit.cs.ut.ee/clustvis/

minimaxir•7h ago

> I EXPECT them to be. After all, they are the reverse of one another.

That isn't how tokenized inputs work. It's partially the same reason why "how many r's are in strawberry" is a hard problem for LLMs.

All these models are trained for semantic similarity by how they are actually used in relation to other words, so a data point where that doesn't follow intuitively is indeed weird.

godelski•6h ago

I'm not talking about Tokenization.

It can get confusing because we usually role tokenization and embedding up as a singular process but the tokenization is the translation of our characters into numeric representations. There's self discovery of what the atomic units should be (bounded by our vocabulary size).

The process is, at a high level: string -> integer -> vec<float>. You are learning the string splits, integer IDs, and vector embeddings. You are literally building a dictionary. The BPE paper is a good place to start[0], but it is far from the place we are now.

The embeddings is this data in that latent representation space.

  > All these models are trained for semantic similarity

Citation needed...

There's no real good measure of semantic similarity so it would be really naive to assume that this must be happening. There is a natural pressure for this to occur because words are generated in a bias way, but that's different than saying they're trained to be semantically similar. There's even a bit of discussion about this in the Word2Vec paper[1], but you should also follow some of the citations to dig deeper.

You need to think VERY carefully about the vector basis[2]. You can very easily create an infinite number of basis vectors that are isomorphic to the standard cartesian coordinate. We usually use [[1,0],[0,1]], but there's no reason you can't use some rotation like [[1/sqrt(2), -1/sqrt(2)],[1/sqrt(2),1/sqrt(2)]]. Our (x,y) space is isomorphic to our new (u,v) space but traveling along our u basis vector is not equivalent to traveling along the x basis vector (\hat{i}) or even the y (\hat{j}). You are traveling along them equally! u is still orthogonal to v and x is still orthogonal to y but it is a rotation. we can also do something more complex like using polar coordinates. All this stuff is equivalent! They all provide linearly independent unit vectors that span our field.

The point is, the semantics is a happy outcome, not a guaranteed or even specifically trained for outcome. We should expect it to happen frequently because of hour our languages evolved but the "god" "dog" example perfectly illustrates how this is naive.

You *CANNOT* train for semantic similarity until you *DEFINE* semantic similarity. That definition needs to be a strong rigorous mathematical one. Not an ad-hoc Justice Potter "know it when I see it" kinda policy. The way they are used in relation to other words is definitely not well aligned to semantics. I can talk about cats and dogs or cats and potatoes all day long. The real similarity we'll come up with there is nouns and that's not much in the way of semantics. Even the examples I gave aren't strictly nouns. Shit gets real fucking messy real fast[3]. It's not just English, it happens in every language[4]

We can get WAY more into this, but no, sorry, that's not how this works.

[0] https://arxiv.org/abs/1508.07909

[1] https://arxiv.org/abs/1301.3781

[2] https://en.wikipedia.org/wiki/Basis_(linear_algebra)

[3] I'll leave you with my favorite example of linguistic ambiguity

  Read rhymes with lead
  and lead rhymes with read
  but read doesn't rhyme with lead
  and lead doesn't rhyme with read

[4] https://en.wikipedia.org/wiki/Lion-Eating_Poet_in_the_Stone_...

PaulHoule•7h ago

The thing about high dimensional vector spaces is that when N is large they are strangely different from the N=2 and N=3 cases we are familiar with. For instance when N=3 you could imagine that a cube is not all that different from a sphere, just sand away the corners. If N=10,000 though, the "cube" has a huge number of corners which are a distance of 100 away from the origin whereas the sphere never gets past 1. Hypercubes look something like this

https://www.amazon.com/Torre-Tagus-901918B-Spike-Sphere/dp/B...

A consequence of that is that many visualizations give people the wrong idea so I wouldn't try too hard.

Of everything in the article I like the histograms of similarity the best but they are in the weeds a lot with things like "god" ~ "dog". When I was building search engines I looked a lot at graphs that showed the similarity distribution of relevant vs irrelevant results

I'll argue bitterly about word embeddings being "very good" for anything; actually that similarity distribution looks pretty good, but my experience is that when you are looking at N words word vectors look promising when N=5 but when N>50 or so they break down completely. I've worked on teams that were considering both RNN and CNN models. My thinking was that if word embeddings had any knowledge in them that a deep model could benefit from you could also train a classical ML model (say some kind of SVM) to classify words on some characteristic like "is a color" or "is a kind of person" or "can be used as a verb" but I could never get it to work.

Now I went looking and never found that anyone had published positive or negative results for such a classifier, my feeling was it was a terrible tarpit, particularly when N was tiny it would almost seem to work but when N increased it would always fall apart. Between the bias that people don't publish negative results and that people who got negative results might blame themselves and not word embeddings or the hype around word embeddings, they didn't get published.

I do collect papers from arXiv where people do some boring text classification task because I do boring text classification tasks and I facepalm so often because people often try 15 or so algos, most of which never work well, and word embeddings are always in that category. If people tried some classical ML algos with bag-of-words and pooled ModernBERT they'd sample a good segment of the efficient frontier -- a BERT embedding doesn't just capture the word, it captures the meaning of the word in context which is night and day different when it comes to relevance because matching the synonyms of all the different word senses brings as many or more irrelevant matches as it does relevant ones.

godelski•4h ago

  >  I like the histograms of similarity the best but they are in the weeds a lot with things like "god" ~ "dog".

I do like those too. But I think we have a tendency to misinterpret what direction we're moving in. Like the coordinate system is highly non-intuitive. Shouldn't be all that surprising that if we create an n-ball around "dog" that you get things that are more semantically meaningful like "cat" or "animal" but jesus christ, we're in over a thousand dimensions. It shouldn't be surprising that one of those directions is letter permutation. I can't even think of what a thousand meaningful directions would be!

Honestly, I think we should be more surprised that cosine similarity even works! Everything should be orthogonal. But clearly the manifold hypothesis is giving us a big leg up here, with the semantic biases built into language too.

People wildly underestimate how complex this topic is. It's baffling. It's mindblowing. And that's why it is so awesome and should excite people! I think we're doing a lot of footgunning by thinking this stuff is simple or solved. It is a wonderfully rich topic with so much left to discover.

ithkuil•7h ago

Geometry in higher dimensions is not only hard to imagine, it's straight up weird.

Take a cube on N dimensions and pack N dimensional spheres inside that cube. Then fit a sphere inside the cube so that it touches but doesn't overlap with any of the other spheres.

In 2D and 3D is easy to visualize and you can see that sphere in the center is smaller than the other spheres and of course it's smaller than the cube itself; after all, it's surrounded by the other spheres that are by construction inside the cube.

Above 10 dimensions the size of the inner hypersphere is actually bigger than the size of the hypercube despite being surrounded by hyperspheres that are contained inside the hyper-cube!

The math behind it is straightforward but the implication is as counterintuitive as it gets

godelski•4h ago

Or how the volume of an n-ball goes to 0[0,1]

Or how gaussian balls are like soap bubbles[2]

The latter of which being highly relevant to vector embeddings. Because if you aren't a uniform distribution, the density of your mass isn't uniform. MEANING if you linearly interpolate between two points in the space you are likely to get things that are not representative of your distribution. It happens because it is easy to confuse a linear line with a geodesic[3]. Like trying to draw a straight line between Los Angeles and Paris. You're going to be going through the dirt most of the time. Looks nothing like cities or even habitable land.

I think the basic math is straight forward but there's a lot of depth that is straight up ignored in most of our discussions about this stuff. There's a lot of deep math here and we really need to talk a lot about the algebraic structures, topologies, get deep into metric theory and set theory to push forward in answering these questions. I think this belief that "the math is easy" is holding us back. I like to say "you don't need to know math to train good models, but you do need math to know why your models are wrong." (Obvious reference to "all models are wrong, but some are useful") Especially in CS we have this tendency to oversimplify things and it really is just arrogance that doesn't help us.

[0] https://davidegerosa.com/nsphere/

[1] https://en.wikipedia.org/wiki/Volume_of_an_n-ball

[2] https://www.inference.vc/high-dimensional-gaussian-distribut...

[3] https://en.wikipedia.org/wiki/Geodesic

5tk18•3h ago

I’m having trouble understanding you. I would be curious to see a representation of this in 2d or 3d. Do you know of any good resources?

terranmott•5h ago

Agree and you might like 4D Toys! One of my favorite things for building a little intuition in different dimensions.

I like how it shows shadows and cross-sections from 2D->3D, and then from 3D->4D. Really captures the uncanny playfulness of it all.

https://4dtoys.com/

godelski•4h ago

I remember when that was dropped! It nicely complements Flatland, going in the other direction. Highly recommend, but it is easier to miss the limitations of intuitions here than it is when looking at Flatland. But still, highly recommend. I think it helps highlight how non-intuitive things are. I'd suggest people try to predict movements of things before that movement happening. Helps to see how wrong you are because we have a tendency to post hoc justify why we actually knew something was going to happen lol

antirez•9h ago

Here I tried to use 2D visualization, and it may be more immediate:

https://antirez.com/news/150

janalsncm•9h ago

Where can I find the original dataset for doing stylometry?

pamelafox•7h ago

Thanks for sharing that too!

I do have a notebook that does a PCA reduction to plot a similarity space: https://github.com/pamelafox/vector-embeddings-demos/blob/ma...

But as I noted in another comment, I think it loses so much information as to be deceiving.

I also find this 3d visualization to be fun: https://projector.tensorflow.org/

But once again, huge loss in information.

I personally learn more by actually seeing the relative similarity ranking and scores within a dataset, versus trying to visualize all of the nodes on the same graph with a massive dimension simplification.

That 3d visualization is what originally intrigued me though, to see how else I could visualize. :)

pamelafox•7h ago

I forgot that I also put together this little website, if you want to compare vectors for word2vec versus text-embedding-ada-002: https://pamelafox.github.io/vectors-comparison/

(I never added text-embedding-3 to it)

podgietaru•5h ago

I did the traditional blog post about this, and used it to create an RSS Aggregator website using AWS Bedrock.

https://aws.amazon.com/blogs/machine-learning/use-language-e...

The website is unfortunately down now, due to the fact I no longer work at Amazon, but the code is still readily available if you want to run it yourself.

https://github.com/aws-samples/rss-aggregator-using-cohere-e...

kaycebasques•4h ago

The post title reminds me of something that I researched a little a couple months back. Practically all embeddings are implemented as vectors, right? Definitionally, an embedding doesn't have to be a vector. But in practice there's not really any such thing as a non-vector embedding, is there?

One thing I learned recently is that, if your embedding model supports task types (clustering, STS, retrieval, etc.), then that can have a non-trivial impact on the generated embedding for a given text: https://technicalwriting.dev/ml/embeddings/tasks/index.html

Parquet and Polars sound very promising for reducing embeddings storage requirements. Still haven't tinkered with them: https://minimaxir.com/2025/02/embeddings-parquet/

And this post gave me a lot more awareness to be more careful about how exactly I'm comparing embeddings. OP's post seems to do a good job explaining common techniques, too. https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity/

galaxyLogic•4h ago

> The text-embedding-ada-002 model accepts up to 8192 "tokens", where a "token" is the unit of measurement for the model (typically corresponding to a word or syllable),

So the "input" is up to 8192 "units of measurement". What would that mean in practice? How are the units of measurement produced? Can they be anything?

striking•3h ago

They're produced from a tokenizer. Technically they could be anything (even raw bytes) but you get better results by choosing a better tokenization strategy.

jamesk_au•3h ago

Does anyone have any insight (or informed guesses) that might explain the strange downward "spike" that was consistently observed at dimension 196 in OpenAI's text-embedding-ada-002 model?

persedes•3h ago

Would be interesting to see how e.g sentence transformer models compare to this. My takeaway with the e.g. openai embedding models was that they were better suited for larger chunks of texts, so getting god + dog with a higher similarity might be indicative that it's not a good model for such small text?

  emb = SentenceTransformer("all-MiniLM-L6-v2")
  embeddings = emb.encode(["dog", "god"])
  cosine_similarity(embeddings)
  Out[16]: 
  array([[1.        , 0.41313702],
       [0.41313702, 1.0000004 ]], dtype=float32)

tanelpoder•2h ago

I took a completely different path to visualizing embedding vectors’ physical layout [1] - mainly to explain how the data structures, data volumes and comparison would radically differ, compared to your regular btree index searches. I made sure to mention that you can’t make any conclusions based on just human eyeballing of these vector heatmaps, but the database people I’ve demoed this to, seem to have reached some a-ha moments about understanding how radically different a vector search is compared to the usual database index lookup work:

[1] https://tanelpoder.com/posts/comparing-vectors-of-the-same-r...

Someone told me to "touch grass," so I made this simulator

Decomplexification

Hugging Face unveils two new humanoid robots

The Zod Engine

The ICE agents disappearing your neighbors would like a little privacy, please

US says it will start revoking visas for Chinese students

We built a tool to read and learn Spanish.

DeepSeek-R1-0528 Released

AI artwork that's more than a gimmick

A Scientist Says Humans Will Go Backwards in Time Within Just 4 Years

OpenBAO (Vault open-source fork) Namespaces

Famous NYC skyscraper was almost toppled by winds

The Yasny Breach: Russia's Nuclear Secrets Laid Bare by a Procurement Leak

Coinbase Returns to SF: A Tech Boomerang Story

Spacewar (DEC PDP-1 emulated in JavaScript)

New Visa Policies Put America First, Not China

History of the Electronic Self Playing Piano System

Delivering full-stack 4G/5G and converged BSS in a remote Caribbean island

Oregon Bill to Block Private-Equity Medical Deals Heads to Governor's Desk

MariaDB Acquires Galera Cluster

Product of Additive Inverses

Player Piano Rolls

Google's Stitch nails prompt abstraction – I explored similar ideas in janicre

Quiero crear una nueva aplicación

White House MAHA Report may have garbled science by using AI, experts say

Prosody 13.0.2 released – An XMPP/Jabber server written in Lua

Pick Your (User Agent) Battles

RAG is dead, long live agentic retrieval

Bayes for Everyone

Source Code for Oracle VirtualBox

Someone told me to "touch grass," so I made this simulator

Decomplexification

Hugging Face unveils two new humanoid robots

The Zod Engine

The ICE agents disappearing your neighbors would like a little privacy, please

US says it will start revoking visas for Chinese students

We built a tool to read and learn Spanish.

DeepSeek-R1-0528 Released

AI artwork that's more than a gimmick

A Scientist Says Humans Will Go Backwards in Time Within Just 4 Years

OpenBAO (Vault open-source fork) Namespaces

Famous NYC skyscraper was almost toppled by winds

The Yasny Breach: Russia's Nuclear Secrets Laid Bare by a Procurement Leak

Coinbase Returns to SF: A Tech Boomerang Story

Spacewar (DEC PDP-1 emulated in JavaScript)

New Visa Policies Put America First, Not China

History of the Electronic Self Playing Piano System

Delivering full-stack 4G/5G and converged BSS in a remote Caribbean island

Oregon Bill to Block Private-Equity Medical Deals Heads to Governor's Desk

MariaDB Acquires Galera Cluster

Product of Additive Inverses

Player Piano Rolls

Google's Stitch nails prompt abstraction – I explored similar ideas in janicre

Quiero crear una nueva aplicación

White House MAHA Report may have garbled science by using AI, experts say

Prosody 13.0.2 released – An XMPP/Jabber server written in Lua

Pick Your (User Agent) Battles

RAG is dead, long live agentic retrieval

Bayes for Everyone

Source Code for Oracle VirtualBox

A visual exploration of vector embeddings

Comments