As a tangent, what root data source are you using to calculate the movie embeddings?
Here's where I calculate cosine without numpy: https://github.com/pamelafox/vector-embeddings-demos/blob/ma...
And in the distance notebook, I calculate with numpy: https://github.com/pamelafox/vector-embeddings-demos/blob/ma... I didn't use the @ operator! TIL.
I forget where I originally got the Disney movie titles, but it is notably just the titles. A better ranking would be based off a movie synopsis as well. Here's where I calculated their embeddings using OpenAI: https://github.com/pamelafox/vector-embeddings-demos/blob/ma...
Maybe I can submit a poster to Pytorch that would include the Python code as well.
Once you move up in dimensionality things get really messy really fast. There's a contraction in variance and the meaning of distance becomes much more fuzzy. You can't differentiate your nearest neighbor from your furthest. Angles get much harder too. Everything is orthogonal, in most directions too! I'm not all that surprised "god" and "dog". I EXPECT them to be. After all, they are the reverse of one another. The question rather is about "similar in which direction?"
There's no need to believe you've measured along a direction that is human meaningful. So doesn't have to be semantics. Doesn't have to be permutations either. Just like you can rotate your xy axis and travel in both directions.
So these things can really trick us. At best, be very careful to not become overly reliant upon them
I think the other graphs I included aren't deceiving, they're just not quite as fun as an attempt to visualize the similarity space.
I don't think the other graphs are necessarily deceiving but I think they don't capture as much information as we often imply and I think this ends up leading people to make wrong assumptions about what is happening in the data.
Embeddings and immersions get really fucking weird at high dimensions. I mean it gets weird at like 4D and insane by 10D. The spaces we're talking about are incomprehensible. Every piece of geometric intuition you have should be thrown out the window. It won't help you, it harms you. If you start digging into the high dimensional statistics and metric theory for high dimensions you'll quickly see what I'm talking about. Like the craziness of Lp distances and contraction of variance. Like you have to really dig into why we prefer L1 over L2 and why even fractional ps are of interest. We run into all kinds of problems with i.i.d. assumptions and all that. It is wild how many assumptions are being made that we generally don't even think about. They seem obvious and natural to use, but they don't work very well when D>3. I do think the visualizations become useful again once you start getting used to this again but that's more like in the way that you are interpreting it with far less generalization in meaning.
I'm not trying to dunk on your post. I think it is fine. But I think our ML community needs to be having more conversations about these limits. We're really running into issues with them.
That isn't how tokenized inputs work. It's partially the same reason why "how many r's are in strawberry" is a hard problem for LLMs.
All these models are trained for semantic similarity by how they are actually used in relation to other words, so a data point where that doesn't follow intuitively is indeed weird.
It can get confusing because we usually role tokenization and embedding up as a singular process but the tokenization is the translation of our characters into numeric representations. There's self discovery of what the atomic units should be (bounded by our vocabulary size).
The process is, at a high level: string -> integer -> vec<float>. You are learning the string splits, integer IDs, and vector embeddings. You are literally building a dictionary. The BPE paper is a good place to start[0], but it is far from the place we are now.
The embeddings is this data in that latent representation space.
> All these models are trained for semantic similarity
Citation needed...There's no real good measure of semantic similarity so it would be really naive to assume that this must be happening. There is a natural pressure for this to occur because words are generated in a bias way, but that's different than saying they're trained to be semantically similar. There's even a bit of discussion about this in the Word2Vec paper[1], but you should also follow some of the citations to dig deeper.
You need to think VERY carefully about the vector basis[2]. You can very easily create an infinite number of basis vectors that are isomorphic to the standard cartesian coordinate. We usually use [[1,0],[0,1]], but there's no reason you can't use some rotation like [[1/sqrt(2), -1/sqrt(2)],[1/sqrt(2),1/sqrt(2)]]. Our (x,y) space is isomorphic to our new (u,v) space but traveling along our u basis vector is not equivalent to traveling along the x basis vector (\hat{i}) or even the y (\hat{j}). You are traveling along them equally! u is still orthogonal to v and x is still orthogonal to y but it is a rotation. we can also do something more complex like using polar coordinates. All this stuff is equivalent! They all provide linearly independent unit vectors that span our field.
The point is, the semantics is a happy outcome, not a guaranteed or even specifically trained for outcome. We should expect it to happen frequently because of hour our languages evolved but the "god" "dog" example perfectly illustrates how this is naive.
You *CANNOT* train for semantic similarity until you *DEFINE* semantic similarity. That definition needs to be a strong rigorous mathematical one. Not an ad-hoc Justice Potter "know it when I see it" kinda policy. The way they are used in relation to other words is definitely not well aligned to semantics. I can talk about cats and dogs or cats and potatoes all day long. The real similarity we'll come up with there is nouns and that's not much in the way of semantics. Even the examples I gave aren't strictly nouns. Shit gets real fucking messy real fast[3]. It's not just English, it happens in every language[4]
We can get WAY more into this, but no, sorry, that's not how this works.
[0] https://arxiv.org/abs/1508.07909
[1] https://arxiv.org/abs/1301.3781
[2] https://en.wikipedia.org/wiki/Basis_(linear_algebra)
[3] I'll leave you with my favorite example of linguistic ambiguity
Read rhymes with lead
and lead rhymes with read
but read doesn't rhyme with lead
and lead doesn't rhyme with read
[4] https://en.wikipedia.org/wiki/Lion-Eating_Poet_in_the_Stone_...https://www.amazon.com/Torre-Tagus-901918B-Spike-Sphere/dp/B...
A consequence of that is that many visualizations give people the wrong idea so I wouldn't try too hard.
Of everything in the article I like the histograms of similarity the best but they are in the weeds a lot with things like "god" ~ "dog". When I was building search engines I looked a lot at graphs that showed the similarity distribution of relevant vs irrelevant results
I'll argue bitterly about word embeddings being "very good" for anything; actually that similarity distribution looks pretty good, but my experience is that when you are looking at N words word vectors look promising when N=5 but when N>50 or so they break down completely. I've worked on teams that were considering both RNN and CNN models. My thinking was that if word embeddings had any knowledge in them that a deep model could benefit from you could also train a classical ML model (say some kind of SVM) to classify words on some characteristic like "is a color" or "is a kind of person" or "can be used as a verb" but I could never get it to work.
Now I went looking and never found that anyone had published positive or negative results for such a classifier, my feeling was it was a terrible tarpit, particularly when N was tiny it would almost seem to work but when N increased it would always fall apart. Between the bias that people don't publish negative results and that people who got negative results might blame themselves and not word embeddings or the hype around word embeddings, they didn't get published.
I do collect papers from arXiv where people do some boring text classification task because I do boring text classification tasks and I facepalm so often because people often try 15 or so algos, most of which never work well, and word embeddings are always in that category. If people tried some classical ML algos with bag-of-words and pooled ModernBERT they'd sample a good segment of the efficient frontier -- a BERT embedding doesn't just capture the word, it captures the meaning of the word in context which is night and day different when it comes to relevance because matching the synonyms of all the different word senses brings as many or more irrelevant matches as it does relevant ones.
> I like the histograms of similarity the best but they are in the weeds a lot with things like "god" ~ "dog".
I do like those too. But I think we have a tendency to misinterpret what direction we're moving in. Like the coordinate system is highly non-intuitive. Shouldn't be all that surprising that if we create an n-ball around "dog" that you get things that are more semantically meaningful like "cat" or "animal" but jesus christ, we're in over a thousand dimensions. It shouldn't be surprising that one of those directions is letter permutation. I can't even think of what a thousand meaningful directions would be!Honestly, I think we should be more surprised that cosine similarity even works! Everything should be orthogonal. But clearly the manifold hypothesis is giving us a big leg up here, with the semantic biases built into language too.
People wildly underestimate how complex this topic is. It's baffling. It's mindblowing. And that's why it is so awesome and should excite people! I think we're doing a lot of footgunning by thinking this stuff is simple or solved. It is a wonderfully rich topic with so much left to discover.
Take a cube on N dimensions and pack N dimensional spheres inside that cube. Then fit a sphere inside the cube so that it touches but doesn't overlap with any of the other spheres.
In 2D and 3D is easy to visualize and you can see that sphere in the center is smaller than the other spheres and of course it's smaller than the cube itself; after all, it's surrounded by the other spheres that are by construction inside the cube.
Above 10 dimensions the size of the inner hypersphere is actually bigger than the size of the hypercube despite being surrounded by hyperspheres that are contained inside the hyper-cube!
The math behind it is straightforward but the implication is as counterintuitive as it gets
Or how gaussian balls are like soap bubbles[2]
The latter of which being highly relevant to vector embeddings. Because if you aren't a uniform distribution, the density of your mass isn't uniform. MEANING if you linearly interpolate between two points in the space you are likely to get things that are not representative of your distribution. It happens because it is easy to confuse a linear line with a geodesic[3]. Like trying to draw a straight line between Los Angeles and Paris. You're going to be going through the dirt most of the time. Looks nothing like cities or even habitable land.
I think the basic math is straight forward but there's a lot of depth that is straight up ignored in most of our discussions about this stuff. There's a lot of deep math here and we really need to talk a lot about the algebraic structures, topologies, get deep into metric theory and set theory to push forward in answering these questions. I think this belief that "the math is easy" is holding us back. I like to say "you don't need to know math to train good models, but you do need math to know why your models are wrong." (Obvious reference to "all models are wrong, but some are useful") Especially in CS we have this tendency to oversimplify things and it really is just arrogance that doesn't help us.
[0] https://davidegerosa.com/nsphere/
[1] https://en.wikipedia.org/wiki/Volume_of_an_n-ball
[2] https://www.inference.vc/high-dimensional-gaussian-distribut...
I like how it shows shadows and cross-sections from 2D->3D, and then from 3D->4D. Really captures the uncanny playfulness of it all.
I do have a notebook that does a PCA reduction to plot a similarity space: https://github.com/pamelafox/vector-embeddings-demos/blob/ma...
But as I noted in another comment, I think it loses so much information as to be deceiving.
I also find this 3d visualization to be fun: https://projector.tensorflow.org/
But once again, huge loss in information.
I personally learn more by actually seeing the relative similarity ranking and scores within a dataset, versus trying to visualize all of the nodes on the same graph with a massive dimension simplification.
That 3d visualization is what originally intrigued me though, to see how else I could visualize. :)
(I never added text-embedding-3 to it)
https://aws.amazon.com/blogs/machine-learning/use-language-e...
The website is unfortunately down now, due to the fact I no longer work at Amazon, but the code is still readily available if you want to run it yourself.
https://github.com/aws-samples/rss-aggregator-using-cohere-e...
One thing I learned recently is that, if your embedding model supports task types (clustering, STS, retrieval, etc.), then that can have a non-trivial impact on the generated embedding for a given text: https://technicalwriting.dev/ml/embeddings/tasks/index.html
Parquet and Polars sound very promising for reducing embeddings storage requirements. Still haven't tinkered with them: https://minimaxir.com/2025/02/embeddings-parquet/
And this post gave me a lot more awareness to be more careful about how exactly I'm comparing embeddings. OP's post seems to do a good job explaining common techniques, too. https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity/
So the "input" is up to 8192 "units of measurement". What would that mean in practice? How are the units of measurement produced? Can they be anything?
emb = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = emb.encode(["dog", "god"])
cosine_similarity(embeddings)
Out[16]:
array([[1. , 0.41313702],
[0.41313702, 1.0000004 ]], dtype=float32)
[1] https://tanelpoder.com/posts/comparing-vectors-of-the-same-r...
isjustintime•11h ago
cratermoon•11h ago