frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Embeddings Are Underrated

https://technicalwriting.dev/ml/embeddings/overview.html
133•jxmorris12•1h ago

Comments

jacobr1•1h ago
I may have missed it ... but were any direct applications to tech writers discussed in this article? Embeddings are fascinating and very important for things like LLMs or semantic search, but the author seems to imply more direct utility.
PaulHoule•58m ago
Semantic search and classification and clustering. For the first, there is a substantial breakthrough in IR every 10 years or so you take what you can get. (I got so depressed reading TREC proceedings which seemed to prove that "every obvious idea to improve search relevance doesn't work" and it wasn't until I found a summary of the first ten years that I learned that the first ten years had turned up one useful result, BM2.5)

As for classification, it is highly practical to put a text through an embedding and then run the embedding through a classical ML algorithm out of

https://scikit-learn.org/stable/supervised_learning.html

This works so consistently that I'm considering not packing in a bag-of-words classifier in a text classification library I'm working on. People who hold court on Huggingface forums tends to believe you can do better with fine-tuned BERT, and I'd agree you can do better with that, but training time is 100x and maybe you won't.

20 years ago you could make bag-of-word vectors and put them through a clustering algorithm

https://scikit-learn.org/stable/modules/clustering.html

and it worked but you got awful results. With embeddings you can use a very simple and fast algorithm like

https://scikit-learn.org/stable/modules/clustering.html#k-me...

and get great clusters.

I'd disagree with the bit that it takes "a lot of linear algebra" to find nearby vectors, it can be done with a dot product so I'd say it is "a little linear algebra"

podgietaru•54m ago
I built an rss aggregator with semantic search using embeddings. The main usage was being able to categorise based on any randomly created category. So you could have arbitrary categories

https://github.com/aws-samples/rss-aggregator-using-cohere-e...

Unfortunately I no longer work at AWS so the infrastructure that was running it is down.

kaycebasques•37m ago
> were any direct applications to tech writers discussed in this article

No, it was supposed to be a teaser post followed up by more posts and projects exploring the different applications of embeddings in technical writing (TW). But alas, life happened, and I'm now a proud new papa with a 3-month old baby :D

I do have other projects and embeddings-related posts in the pipeline. Suffice to say, embeddings can help us make progress on all 3 of the "intractable" challengs of TW mentioned here: https://technicalwriting.dev/strategy/challenges.html

sansseriff•30m ago
It would be great to semantically search through literature with embeddings. At least one person I know if is trying to generate a vector database of all arxiv papers.

The big problem I see is attribution and citations. An embedding is just a vector. It doesn't contain any citation back to the source material or modification date or certificate of authenticity. So when using embeddings in RAG, they only serve to link back to a particular page of source material.

Using embeddings as links doesn't dramatically change the way citation and attribution are handled in technical writing. You still end up citing a whole paper or a page of a paper.

I think GraphRAG [1] is a more useful thing to build on for technical literature. There's ways to use graphs to cite a particular concept of a particular page of an academic paper. And for the 'citations' to act as bidirectional links between new and old scientific discourse. But I digress

[1] https://microsoft.github.io/graphrag/

lblume•1h ago
Semantic search seems like a more promising usecase than simple related articles. A big problem with classical keyword-based search is that synonyms are not reflected at all. With semantic search you can search for what you mean, not what words you expect to find on the site you are looking for.
kgeist•59m ago
In my benchmarks for a service which is now running in production, hybrid search based on both keywords and embeddings performed the best. Sometimes you need exact keyword matches; other times, synonyms are more useful. Hybrid search combines both sets of results into a single, unified set. OpenSearch has built-in support for this approach.
PaulHoule•45m ago
A case related to that is "more like this" which in my mind breaks down into two forks:

(1) Sometimes your query is a short document. Say you wanted to know if there were any patents similar to something you invented. You'd give a professional patent searcher a paragraph or a few paragraphs describing the invention, you can give a "semantic search engine" the paragraph -- I helped build one that did about as well as the professional using embeddings before this was cool.

(2) Even Salton's early works on IR talked about "relevance feedback" where you'd mark some documents in your results as relevant, some as irrelevant. With bag-of-words this doesn't really work well (it can take 1000 samples for a bag-of-words classifier to "wake up") but works much better with embeddings.

The thing is that embeddings are "hunchy" and not really the right data structure to represent things like "people who are between 5 feet and 6 feet tall and have been on more than 1000 airplane flights in their life" (knowledge graph/database sorts of queries) or "the thread that links the work of Derrida and Badiou" (could be spelled out logically in some particular framework but doing that in general seems practically intractable)

jbellis•44m ago
they're both useful

search is an active "I'm looking for X"

related articles is a passive "hey thanks for reading this article, you might also like Y"

ncruces•1h ago
Previous discussion: https://news.ycombinator.com/item?id=42013762
kaycebasques•1h ago
Hello, I wrote this. Thank you for reading!

The post was previously discussed 6 months ago: https://news.ycombinator.com/item?id=42013762

To be clear, when I said "embeddings are underrated" I was only arguing that my fellow technical writers (TWs) were not paying enough attention to a very useful new tool in the TW toolbox. I know that the statement sounds silly to ML practitioners, who very much don't "underrate" embeddings.

I know that the post is light on details regarding how exactly we apply embeddings in TW. I have some projects and other blog posts in the pipeline. Short story long, embeddings are important because they can help us make progress on the 3 intractable challenges of TW: https://technicalwriting.dev/strategy/challenges.html

rybosome•33m ago
Thanks for the write-up!

I’m curious how you found the quality of the results? This gets into evals which ML folks love, but even just with “vibes” do the results eyeball as reasonable to you?

petesergeant•1h ago
I wrote an embeddings explainer a few days ago if anyone is interested: https://sgnt.ai/p/embeddings-explainer/

Very little maths and lots of dogs involved.

podgietaru•56m ago
I wrote a blog post about embedding - and a sample application to show their uses.

https://aws.amazon.com/blogs/machine-learning/use-language-e...

https://github.com/aws-samples/rss-aggregator-using-cohere-e...

I really enjoy working with embedding. They’re truly fascinating as a representation of meaning - but also a very cheap and effective way to perform very cheap things like categorisation and clustering.

btbuildem•44m ago
How would you approach using them in a specialized discipline (think technical jargon, acronyms etc) where traning a model from scratch is practically impossible because everyone (customers, solution providers) fiercely guards their data?

A generic embedding model does not have enough specificity to cluster the specialized terms or "code names" of specific entities (these differ across orgs but represent the same sets of concepts within the domain). A more specific model cannot be trained because the data is not available.

Quite the conundrum!

tyho•54m ago
> The 2D map analogy was a nice stepping stone for building intuition but now we need to cast it aside, because embeddings operate in hundreds or thousands of dimensions. It’s impossible for us lowly 3-dimensional creatures to visualize what “distance” looks like in 1000 dimensions. Also, we don’t know what each dimension represents, hence the section heading “Very weird multi-dimensional space”.5 One dimension might represent something close to color. The king - man + woman ≈ queen anecdote suggests that these models contain a dimension with some notion of gender. And so on. Well Dude, we just don’t know.

nit. This suggests that the model contains a direction with some notion of gender, not a dimension. Direction and dimension appear to be inextricably linked by definition, but with some handwavy maths, you find that the number of nearly orthogonal dimensions within n dimensional space is exponential with regards to n. This helps explain why spaces on the order of 1k dimensions can "fit" billions of concepts.

kaycebasques•44m ago
Oh yes, this makes a lot of sense, thank you for the "nit" (which doesn't feel like a nit to me, it feels like an important conceptual correction). When I was writing the post I definitely paused at that part, knowing that something was off about describing the model as having a dimension that maps to gender. As you said, since the models are general-purpose and work so well in so many domains, there's no way that there's a 1-to-1 correspondence between concepts and dimensions.

I think your comment is also clicking for me now because I previously did not really understand how cosine similarity worked, but then watched videos like this and understand it better now: https://youtu.be/e9U0QAFbfLI

I will eventually update the post to correct this inaccuracy, thank you for improving my own wetware's conceptual model of embeddings

OJFord•27m ago
I would think of it as the whole embedding concept again on a finer grained scale: you wouldn't say the model 'has a dimension of whether the input is king', instead the embedding expresses the idea of 'king' with fewer dimensions than would be needed to cover all ideas/words/tokens like that.

So the distinction between a direction and a dimension expressing 'gender' is that maybe gender isn't 'important' (or I guess high-information-density) enough to be an entire dimension, but rather is expressed by a linear combination of two (or more) yet more abstract dimensions.

aaronblohowiak•31m ago
>nearly orthogonal dimensions within n dimensional space

nit within a nit: I believe you intended to write "nearly orthogonal directions within n dimensional space" which is important as you are distinguishing direction from dimension in your post.

PaulHoule•30m ago
Note you don't see arXiv papers where somebody feeds in 1000 male gendered words into a word embedding and gets 950 correct female gendered words. Statistically it does better than chance, but word embeddings don't do very well.

In

https://nlp.stanford.edu/projects/glove/

there are a number of graphs where they have about N=20 points that seem to fall in "the right place" but there are a lot of dimensions involved and with 50 dimensions to play with you can always find a projection that makes the 20 points fall exactly where you want them fall. If you try experiments with N>100 words you go endlessly in circles and produce the kind of inconclusively negative results that people don't publish.

The BERT-like and other transformer embeddings far outperform word vectors because they can take into account the context of the word. For instance you can't really build a "part of speech" classifier that can tell you "red" is an adjective because it is also a noun, but give it the context and you can.

In the context of full text search, bringing in synonyms is a mixed bag because a word might have 2 or 3 meanings and the the irrelevant synonyms are... irrelevant and will bring in irrelevant documents. Modern embeddings that recognize context not only bring in synonyms but the will suppress usages of the word with different meanings, something the IR community has tried to figure out for about 50 years.

osigurdson•26m ago
You can't visualize it but you can certainly compute the euclidean distance. Tools like UMAP can be used to drop the dimensionality as well.
aswanson•10m ago
Any good umap links?
daxfohl•6m ago
Wait, but if gender was composed of say two dimensions, then there'd be no way to distinguish between "the gender is different" and "the components represented by each of those dimensions are individually different", right?
jbellis•46m ago
Great to see embeddings getting some love outside the straight-up-ML space!

I had a non-traditional use case recently, as well. I wanted to debounce the API calls I'm making to gemini flash as the user types his instructions, and I decided to try a very lightweight embeddings model, light enough to run on CPU and way too underpowered to attempt vector search with. It works pretty well! https://brokk.ai/blog/brokk-under-the-hood

stefanka•36m ago
I like that this looks like a very ethical and "fair" use of the LLM technology
minimaxir•16m ago
> I don’t know. After the model has been created (trained), I’m pretty sure that generating embeddings is much less computationally intensive than generating text.

An embedding is generated after a single pass through the model, so functionally it's the equivalent of generating a single token from an text generation model.

jasonjmcghee•9m ago
Another very cool attribute of embeddings and embedding search is that they are resource cheap enough that you can perform them client side.

ONNX models can be loaded and executed with transformer.js https://github.com/huggingface/transformers.js/

You can even build and statically host indices like hnsw for embeddings.

I put together a little open source demo for this here https://jasonjmcghee.github.io/portable-hnsw/ (it's a prototype / hacked together approximation of hnsw, but you could implement the real thing)

Long story short, represent indices as queryable parquet files and use duckdb to query them.

Depending on how you host, it's either free or nearly free. I used Github Pages so it's free. R2 with cloudflare would only cost the size what you store (very cheap- no egress fees).

charcircuit•4m ago
How are they underrated when they have been been used by the top sites for over a decade? The author doesn't really explain why he thinks they are underrated despite being behind almost every search and recommendation users receive on their computers.
daxfohl•2m ago
I wonder if this could be used to find redundant code

Fingers wrinkle in the same pattern every time

https://www.binghamton.edu/news/story/5547/do-your-fingers-wrinkle-the-same-way-every-time-youre-in-the-water-too-long-new-research-says-yes
1•geox•1m ago•0 comments

Perplexity wrapping talks to raise $500M at $14B valuation

https://www.cnbc.com/2025/05/12/perplexity-funding-round-comet.html
1•mfiguiere•4m ago•0 comments

If Everyone Has Trauma, Everyone Has Trauma

https://freddiedeboer.substack.com/p/if-everyone-has-trauma-everyone-has
1•paulpauper•6m ago•0 comments

I hacked a dating app (and how not to treat a security researcher)

https://alexschapiro.com/blog/security/vulnerability/2025/04/21/startups-need-to-take-security-seriously
2•bearsyankees•6m ago•0 comments

At least five interesting things: Requiem for capitalism edition (#63)

https://www.noahpinion.blog/p/at-least-five-interesting-things-b5d
1•paulpauper•7m ago•0 comments

Show HN: Shorts Stopper – Block YouTube Shorts on Safari iOS

https://apps.apple.com/us/app/shorts-stopper/id6745517488
1•abyesilyurt•7m ago•0 comments

We built AI-powered Root Cause Analysis that works

https://coroot.com/blog/we-built-ai-powered-root-cause-analysis-that-actually-works/
1•ekiauhce•8m ago•0 comments

Microsoft shares rare look at Windows 11 Start menu designs it explored

https://www.windowscentral.com/software-apps/windows-11/microsoft-shares-rare-look-at-radical-windows-11-start-menu-designs-it-explored-before-settling-on-the-least-interesting-one-of-the-bunch
1•taubek•8m ago•0 comments

How the Net Was Won – University of Michigan Heritage Project

https://heritage.umich.edu/stories/how-the-net-was-won/
1•rbanffy•9m ago•0 comments

The Internet 1997 – 2021

https://www.opte.org/the-internet
2•smusamashah•10m ago•0 comments

Ex-UK Special Forces break silence on 'war crimes' by colleagues

https://www.bbc.com/news/articles/cj3j5gxgz0do
4•tartoran•11m ago•0 comments

Spall: A code profiler that runs in the browser

https://gravitymoth.com/spall/spall-web.html
2•surprisetalk•11m ago•0 comments

Why So Many in Gen Z Are Choosing the Creator Economy over Degrees in India

https://www.outlookbusiness.com/magazine/gen-z-is-ghosting-degrees-and-day-jobs-to-go-all-in-on-the-creator-economy-in-india
1•yarapavan•13m ago•0 comments

Show HN: I built a system to make ChatGPT brutally honest with you

https://www.honestprompts.com/
1•moobuilds•13m ago•0 comments

RIP Usenix ATC

https://bcantrill.dtrace.org/2025/05/11/rip-usenix-atc/
2•joecobb•16m ago•0 comments

Google Worried It Couldn't Control How Israel Uses Project Nimbus, Files Reveal

https://theintercept.com/2025/05/12/google-nimbus-israel-military-ai-human-rights/
4•jaredwiener•17m ago•0 comments

The Formula for Business Success

https://sekniqi.com/business-formula/
1•sekniqi•17m ago•0 comments

PKK Kurdish militant group will disband

https://www.npr.org/2025/05/12/g-s1-65852/pkk-kurdish-militant-group-disband
3•marojejian•18m ago•1 comments

Two Supreme Court Cases That Could Break the Internet (2023)

https://www.newyorker.com/news/q-and-a/two-supreme-court-cases-that-could-break-the-internet
2•ColinWright•18m ago•0 comments

The effect of ChatGPT on students' learning performance: meta-analysis

https://www.nature.com/articles/s41599-025-04787-y
1•michalpleban•22m ago•0 comments

Ask HN: Where to get used hardware cheap?

3•laserstrahl•22m ago•2 comments

Tell HN: You can't stop YouTube autoplaying on Chrome with a browser extension

1•benatkin•23m ago•0 comments

Roons

https://whomtech.com/roons/
2•speckx•23m ago•0 comments

Firefox on GitHub

https://github.com/mozilla-firefox/firefox
1•fionera•23m ago•0 comments

What's a Home Playoff Game Worth Now?

https://neilpaine.substack.com/p/whats-a-home-playoff-game-worth-now
1•indigodaddy•24m ago•0 comments

Amazon Unit Price – Sort by Unit Price on Amazon

https://amazonunitprice.netlify.app/
2•danc2050•24m ago•0 comments

Understanding Modern AI Is Understanding Embeddings: A Guide with Lots of Dogs

https://sgnt.ai/p/embeddings-explainer/
1•petesergeant•27m ago•0 comments

Freespoke says it is an unbiased news aggregator showing always both sides

https://freespoke.com
1•DyslexicAtheist•31m ago•3 comments

Kennedy Is Right About the Chemicals in Our Food

https://www.nytimes.com/2025/05/12/opinion/kennedy-ultraprocessed-food-dyes.html
4•koolba•32m ago•0 comments

How to title your blog post or whatever

https://dynomight.net/titles/
1•cantaloupe•34m ago•0 comments