(1) Sometimes your query is a short document. Say you wanted to know if there were any patents similar to something you invented. You'd give a professional patent searcher a paragraph or a few paragraphs describing the invention, you can give a "semantic search engine" the paragraph -- I helped build one that did about as well as the professional using embeddings before this was cool.
(2) Even Salton's early works on IR talked about "relevance feedback" where you'd mark some documents in your results as relevant, some as irrelevant. With bag-of-words this doesn't really work well (it can take 1000 samples for a bag-of-words classifier to "wake up") but works much better with embeddings.
The thing is that embeddings are "hunchy" and not really the right data structure to represent things like "people who are between 5 feet and 6 feet tall and have been on more than 1000 airplane flights in their life" (knowledge graph/database sorts of queries) or "the thread that links the work of Derrida and Badiou" (could be spelled out logically in some particular framework but doing that in general seems practically intractable)
search is an active "I'm looking for X"
related articles is a passive "hey thanks for reading this article, you might also like Y"
The post was previously discussed 6 months ago: https://news.ycombinator.com/item?id=42013762
To be clear, when I said "embeddings are underrated" I was only arguing that my fellow technical writers (TWs) were not paying enough attention to a very useful new tool in the TW toolbox. I know that the statement sounds silly to ML practitioners, who very much don't "underrate" embeddings.
I know that the post is light on details regarding how exactly we apply embeddings in TW. I have some projects and other blog posts in the pipeline. Short story long, embeddings are important because they can help us make progress on the 3 intractable challenges of TW: https://technicalwriting.dev/strategy/challenges.html
I’m curious how you found the quality of the results? This gets into evals which ML folks love, but even just with “vibes” do the results eyeball as reasonable to you?
More significantly, after having read the first 6 or 8 paragraphs, i still have no clue what an "embedding" is. From the 3rd paragraph:
> Here’s an overview of how you use embeddings and how they work.
But no mention of what they are (unless perhaps it's buried far deeper in the article).
Very little maths and lots of dogs involved.
https://aws.amazon.com/blogs/machine-learning/use-language-e...
https://github.com/aws-samples/rss-aggregator-using-cohere-e...
I really enjoy working with embedding. They’re truly fascinating as a representation of meaning - but also a very cheap and effective way to perform very cheap things like categorisation and clustering.
A generic embedding model does not have enough specificity to cluster the specialized terms or "code names" of specific entities (these differ across orgs but represent the same sets of concepts within the domain). A more specific model cannot be trained because the data is not available.
Quite the conundrum!
nit. This suggests that the model contains a direction with some notion of gender, not a dimension. Direction and dimension appear to be inextricably linked by definition, but with some handwavy maths, you find that the number of nearly orthogonal dimensions within n dimensional space is exponential with regards to n. This helps explain why spaces on the order of 1k dimensions can "fit" billions of concepts.
I think your comment is also clicking for me now because I previously did not really understand how cosine similarity worked, but then watched videos like this and understand it better now: https://youtu.be/e9U0QAFbfLI
I will eventually update the post to correct this inaccuracy, thank you for improving my own wetware's conceptual model of embeddings
So the distinction between a direction and a dimension expressing 'gender' is that maybe gender isn't 'important' (or I guess high-information-density) enough to be an entire dimension, but rather is expressed by a linear combination of two (or more) yet more abstract dimensions.
This is maybe showing some age as well, or maybe not. It seems that text generation will soon be writing top tier technical docs - the research done on the problem with sycophancy will likely result something significantly better than what LLMs had before the regression to sycophancy. Either way, I take "having the biggest impact on technical writing" to mean in the near term. If having great search and organization tools (ambient findability and such) is going to steal the thunder from LLMs writing really good technical docs, it's going to need to happen fast.
It’s the first in a series of three that I can very highly recommend.
> there's no way that there's a 1-to-1 correspondence between concepts and dimensions.
I don’t know about that! Once you go very high dimensional, there is a lot of direction vectors that are almost perfectly perpendicular to each other (meaning they can cleanly encode a trait). Maybe they don’t even need to be perfectly perpendicular, the dot product just needs to be very close to zero.
nit within a nit: I believe you intended to write "nearly orthogonal directions within n dimensional space" which is important as you are distinguishing direction from dimension in your post.
In
https://nlp.stanford.edu/projects/glove/
there are a number of graphs where they have about N=20 points that seem to fall in "the right place" but there are a lot of dimensions involved and with 50 dimensions to play with you can always find a projection that makes the 20 points fall exactly where you want them fall. If you try experiments with N>100 words you go endlessly in circles and produce the kind of inconclusively negative results that people don't publish.
The BERT-like and other transformer embeddings far outperform word vectors because they can take into account the context of the word. For instance you can't really build a "part of speech" classifier that can tell you "red" is an adjective because it is also a noun, but give it the context and you can.
In the context of full text search, bringing in synonyms is a mixed bag because a word might have 2 or 3 meanings and the the irrelevant synonyms are... irrelevant and will bring in irrelevant documents. Modern embeddings that recognize context not only bring in synonyms but the will suppress usages of the word with different meanings, something the IR community has tried to figure out for about 50 years.
In addition to being able to utilize attention mechanisms, modern embedding models use a form of tokenization such as BPE which a) includes punctuation which is incredibly important for extracting semantic meaning and b) includes case, without as much memory requirements as a cased model.
The original BERT used an uncased, SentencePiece tokenizer which is out of date nowadays.
Little did I know that people were going to have a lot of tolerance for "short circuiting" of LLMs, that is getting the right answer by the wrong path, so I'd say now that my methodology of "predictive evaluation" that would put an upper bound on what a system could do was pessimistic. Still I don't like giving credit for "right answer by wrong means" since you can't count on it.
Ramsey theory (or 'the Woolworths store alignment hypothesis')
While it would certainly have been possible to choose a projection where the two groups of words are linearly separable, that isn't even the case for https://nlp.stanford.edu/projects/glove/images/man_woman.jpg : "woman" is inside the "nephew"-"man"-"earl" triangle, so there is no way to draw a line neatly dividing the masculine from the feminine words. But I think the graph wasn't intended to show individual words classified by gender, but rather to demonstrate that in pairs of related words, the difference between the feminine and masculine word vectors points in a consistent direction.
Of course that is hardly useful for anything (if you could compare unrelated words, at least you would've been able to use it to sort lists...) but I don't think the GloVe authors can be accused of having created unrealistic graphs when their graph actually very realistically shows a situation where the kind of simple linear classifier that people would've wanted doesn't exist.
This is missing the point. What we have is two dimensions* of hundreds, but those two dimensions chosen show that the vector between a masculine word and its feminine counterpart is very nearly constant, at least across these words and excluding other dimensions.
What you're saying, a line/plane/hyper-plane that separates a dimension of gender into male and female, might also exist. But since gender neutral terms also exist, we would expect that to be a plane at which gender neutral terms have a 50/50% chance of falling to either side of the plane, and ideally nearby.
* Possibly a pseudo dimension that's a composite of multiple dimensions; IDK, I didn't read the paper.
For large datasets (as the UMAP algorithm scales in exponential compute), you will need to use the GPU-accelerated UMAP from cuML. https://docs.rapids.ai/api/cuml/stable/api/#umap
You can test this hypothesis with some clever LLM prompting. When I did this I got "male monarch" for "king" but "British ruler" for "queen".
Oops!
>> nit. This suggests that the model contains a direction with some notion of gender ...
In fact, it is likely even more restrictive ...
Even if the said vector arithmetic were to be (approximately) honored by the gender-specific words, it only means there's a specific vector (with a specific direction and magnitude) for such gender translation. 'Woman' + 'king - man' goes to 'queen, however, p * ('king - man') with p being significantly different from one may be a different relation altogether.
The meaning of the vector 'King' - 'man' may be further restricted in that the vector added to a 'Queen' need not land onto some still more royal version of a queen! The networks can learn non-linear behaviors, so the meaning of the vector could be dependent on something about the starting position too.
... unless shown otherwise via experimental data or some reasoning.
I always wondered: if we want to preserve distances between a billion points within 10%, that would mean we need ~18k dimensions. 1% would be 1.8m. Is there a stronger version of the lemma for points that are well spread out? Or are embeddings really just fine with low precision for the distance?
[1] https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...
nit for the nit (micro nit!): Is it meant to be "a number of nearly orthogonal directions within n dimensional space"? Otherwise n dimensional space will have just n dimensions.
I had a non-traditional use case recently, as well. I wanted to debounce the API calls I'm making to gemini flash as the user types his instructions, and I decided to try a very lightweight embeddings model, light enough to run on CPU and way too underpowered to attempt vector search with. It works pretty well! https://brokk.ai/blog/brokk-under-the-hood
An embedding is generated after a single pass through the model, so functionally it's the equivalent of generating a single token from an text generation model.
Compare to ModernBERT, which uses more modern techniques and is still bidirectional, but it is very very speedy. https://huggingface.co/blog/modernbert
ONNX models can be loaded and executed with transformer.js https://github.com/huggingface/transformers.js/
You can even build and statically host indices like hnsw for embeddings.
I put together a little open source demo for this here https://jasonjmcghee.github.io/portable-hnsw/ (it's a prototype / hacked together approximation of hnsw, but you could implement the real thing)
Long story short, represent indices as queryable parquet files and use duckdb to query them.
Depending on how you host, it's either free or nearly free. I used Github Pages so it's free. R2 with cloudflare would only cost the size what you store (very cheap- no egress fees).
To render locally, you need access to the model right? I just wonder how good those embeddings will be compared to those from OpenAI/Google/etc in terms of semantic search. I do like the free/instant aspect though
I've had a particularly good experiences with nomic, bge, gte, and all-MiniLM-L6-v2. All are hundreds of MB (except all-minilm which is like 87MB)
Parquet and Polars are definitely on my radar, though, after reading this: https://minimaxir.com/2025/02/embeddings-parquet/
rhizome search --limit 2 "pull apart"
Model already exists at \~/.rhizome/models/bge-small-en-v1.5.onnx
\~/Projects/rhizome/src/chunking.rs:458\:fn rust\_no\_structural\_items\_fallback() {
\~/Projects/rhizome/src/lib.rs:2\:pub mod chunking;
I think there needs to be some more clarification here. Hash functions also return the same sized output no matter how big or small the input text. However, mathematically comparing two hashes is going to have a much different meaning than mathematically comparing two embeddings.
I'd recommend emphasizing that embeddings are training dependent--the quality of comparison will depend on the quality and type of training used to produce the embedding. There isn't some single "universal embedding" that allows for meaningful comparison of arbitrary text.
[1] https://arxiv.org/abs/2403.20327
Wow, that's bold. I guess "good" technical writing no longer includes a thesis statement.
Seriously though, why would this be useful for technical writing? Sure you could make some similar pages widget however i dont think i've ever wanted that when reading technical docs, let alone writing them.
Embeddings are a _very_ useful tool for building better search - they can handle "fuzzy" matches, where a user can say things like "that feature that lets me run a function against every column of data" because they can't remember the name of the feature.
With embeddings you can implement a hybrid approach, where you mix both keyword search (still necessary because embeddings can miss things that use jargon they weren't trained on) and vector similarity search.
I wish I had good examples to point to for this!
One of the things I love about Sphinx is that it has a decent, client-side, JS-powered offline search. I recently hacked together a workflow for making it search-as-you-type [1]. jasonjmcghee's comment [2] has got me pondering whether we can augment it with transformer.js embeddings.
Thesis is outlined in the second paragraph:
> What embeddings offer to technical writers is the ability to discover connections between texts at previously impossible scales.
I think it's fair, however, to say that this post is ineffective because it does not provide concrete examples of the thesis in action. My only excuse is that I never intended for this to be a standalone post but life got in the way (in the best possible way!) https://news.ycombinator.com/item?id=43964584
> why would this be useful for technical writing?
You're not going to like this answer, because it's also vague. There are 3 intractable challenges in technical writing. Embeddings can help us make progress on all 3: https://technicalwriting.dev/strategy/challenges.html
I could imagine an LLM inference pipeline where the next token ranking includes its similarity to the target embedding, or perhaps instead the change in direction towards/away from the desired embedding that adding it would introduce.
Put another way, the author gives the example:
> embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")
What if you could do that but for whole bodies of text?
I'm imagining being able to do "semantic algebra" with whole paragraphs/articles/books. Instead of just prompting an LLM to "adjust the tone to be more friendly", you could have the core concept of "friendly" (or some more nuanced variant thereof) and "add" it to your existing text, etc.
Embeddings are roughly the equivalent of fuzzy hashes.
Embeddings are a way of mapping a data array to a different (and yes smaller) data array, but the goal is not to compress into one thing, but to spread out into an array of output, where each element of the output has meaning. Embeddings are the exact opposite of hashes.
Hashes destroy meaning. Embeddings create meaning. Hashes destroy structure in space. Embeddings create structures in space.
You’re probably thinking of cryptographic hashes, where avoiding collisions is important. But it’s not intrinsic. For example, Locality Sensitive Hashing where specific types of collisions are encouraged.
The existence of weaker hash algos actually moves you further away from your assertion (that semantic vectors are hashes) than closer to it. Weak hashes is about a small finite number of buckets in one dimension. Semantic vectors are an infinite continuum of higher dimensional space locations. These two concepts are therefore the exact opposite.
[1] "Steering Language Models With Activation Engineering", 2023, https://arxiv.org/abs/2308.10248
[2] "Multi-Attribute Steering of Language Models via Targeted Intervention", 2025, https://arxiv.org/pdf/2502.12446
> I could tell you exactly how I think we might advance the state of the art in technical writing with embeddings, but where’s the fun in that? You now know why they’re such an interesting and useful new tool in the technical writer toolbox… go connect the rest of the dots yourself!
I read the article because of the title, only to find the above.
[1] It's perhaps not even appropriate to say "buried the lede" because that implies the lede is dug back up at some point, whereas this post buries the lede and then forgets where the lede was buried!!
The article feels incomplete
The evidence for this is the COT summary with ChatGPT - I have seen something where the the LLM uses quotes to grep on the web.
Embeddings seem good in theory but in practice its probably best to ask an LLM to do a deep search instead by giving it instructions like "use synonyms and common typos and grep".
Does any one know any live example of a consumer product using embeddings?
So even if LLM's aren't directly passing a vector to the search engine, my assumption is that the search engine is converting to a vector and searching.
"You interact with embeddings every time you complete a Google Search" from https://cloud.google.com/vertex-ai/generative-ai/docs/embedd...
"Brazil" car manufacture
This forces Brazil to be included in the keywords, at least that's how google (used to?) works.
https://youtu.be/r6TJfGUhv6s?si=wG6h1kdigiPrNFdk
Video is in English but please pardon my and my friend Italian accents...
These effects matter in practice. In high-dimensional spaces like word embeddings, even unrelated points can seem equidistant, making basic tasks like clustering or similarity search much harder. So it's not that higher dimensions are mysterious per se, but that they defy the spatial intuitions we've developed from living in three.
Such a simple tool to implement with so much power in certain situations!
https://blog.scottlogic.com/2022/02/23/word-embedding-recomm...
Beta testing an iOS app for it if anyone is interested: https://recallify.app/
jacobr1•6h ago
PaulHoule•5h ago
As for classification, it is highly practical to put a text through an embedding and then run the embedding through a classical ML algorithm out of
https://scikit-learn.org/stable/supervised_learning.html
This works so consistently that I'm considering not packing in a bag-of-words classifier in a text classification library I'm working on. People who hold court on Huggingface forums tends to believe you can do better with fine-tuned BERT, and I'd agree you can do better with that, but training time is 100x and maybe you won't.
20 years ago you could make bag-of-word vectors and put them through a clustering algorithm
https://scikit-learn.org/stable/modules/clustering.html
and it worked but you got awful results. With embeddings you can use a very simple and fast algorithm like
https://scikit-learn.org/stable/modules/clustering.html#k-me...
and get great clusters.
I'd disagree with the bit that it takes "a lot of linear algebra" to find nearby vectors, it can be done with a dot product so I'd say it is "a little linear algebra"
podgietaru•5h ago
https://github.com/aws-samples/rss-aggregator-using-cohere-e...
Unfortunately I no longer work at AWS so the infrastructure that was running it is down.
kaycebasques•5h ago
No, it was supposed to be a teaser post followed up by more posts and projects exploring the different applications of embeddings in technical writing (TW). But alas, life happened, and I'm now a proud new papa with a 3-month old baby :D
I do have other projects and embeddings-related posts in the pipeline. Suffice to say, embeddings can help us make progress on all 3 of the "intractable" challengs of TW mentioned here: https://technicalwriting.dev/strategy/challenges.html
jacobr1•4h ago
kaycebasques•1h ago
(It finally published last week after being in review purgatory for months)
sansseriff•5h ago
The big problem I see is attribution and citations. An embedding is just a vector. It doesn't contain any citation back to the source material or modification date or certificate of authenticity. So when using embeddings in RAG, they only serve to link back to a particular page of source material.
Using embeddings as links doesn't dramatically change the way citation and attribution are handled in technical writing. You still end up citing a whole paper or a page of a paper.
I think GraphRAG [1] is a more useful thing to build on for technical literature. There's ways to use graphs to cite a particular concept of a particular page of an academic paper. And for the 'citations' to act as bidirectional links between new and old scientific discourse. But I digress
[1] https://microsoft.github.io/graphrag/