Roughly, first there is the query analysis/manipulation phase where you might have NER, spell check, query expansion/relaxation etc
Then there is the selection phase, where you retrieve all items that are relevant. Sometimes people will bring in results from both text and vector based indices. Perhaps and additional layer to group results
Then finally you have the reranking layer using a cross encoder model which might even have some personalisation in the mix
Also, with vector search you might not need query expansion necessarily since semantic similarity does loose association. But every domain is unique and there’s only one way to find out
We open-sourced our impl just this week: https://github.com/with-logic/intent
We use Groq with gpt-oss-20b, which gives great results and only adds ~250ms to the processing pipeline.
If you use mini / flash models from OpenAI / Gemini, expect it to be 2.5s-3s of overhead.
- LLMs with a grep or full-text search tool turn out to be great at fuzzy search already - they throw a bunch of OR conditions together and run further searches if they don't find what they want
- ChatGPT web search and Claude Code code search are my favorite AI-assisted search tools and neither bother with vectors
- Building and maintaining a large vector speech index is a pain. The vector are usually pretty big and you need to keep them in memory to get truly great performance. FTS and grep are way less hassle.
- Vector matches are weird. So you get back the top twenty results... those might be super relevant or they might be total garbage, it's on you to do a second pass to figure out if they're actually useful results or not.
I expected to spend much of 2025 building vector search engines, but ended up not finding them as valuable as I had thought.
On the other hand, generating and regenerating embeddings for all your documents can be time consuming and costly, depending on how often you need to reindex
If you find disk I/O for grep acceptable, why would it matter for vectors? They aren’t much bigger, are they?
Embeddings are huge compared to what you need with FTS, which generally has good locality, compresses extremely well, and permits sub-linear intersection algorithms and other tricks to make the most of your IOPS.
Regardless of vector size, you are unlikely to get more than one embedding per I/O operation with a vector approach. Even if you can fit more vectors into a block, there is no good way of arranging them to ensure efficient locality like you can with e.g. a postings list.
Thus off a 500K IOPS drive, given a 100ms execution window, your theoretical upper bound is 50K embeddings ranked, assuming actual ranking takes no time and no other disk operations are performed and you have only a single user.
Given you are more than likely comparing multiple embeddings per document, this carriage turns to a pumpkin pretty rapidly.
Information about how Bing text search works appears to be pretty sparse though.
One of the great mysteries to me right now is how ChatGPT search actually works.
It was Bing when they first launched it, but OpenAI have been investing a ton into their own search infrastructure since then. I can't figure out how much of it is Bing these days vs their own home-rolled system.
What's confusing is how secretive OpenAI are about it! I would personally value it a whole lot more if I understood how it works.
So maybe it's way more vector-based than I believe.
I'd expect any modern search engine to have aspects of vectors somewhere - some kind of hybrid BM25 + vectors thing, or using vectors for re-ranking after retrieving likely matches via FTS. That's different from being pure vectors though.
We need to think about query+content understanding before deciding a sub problem happens to be helped by embeddings. RAG naively looks like a question answering “passage retrieval” problem, when in reality it’s more structured retrieval than we first assume (and LLMs can learn how to use more structured approaches to explore data much better now than in 2022)
https://softwaredoug.com/blog/2025/12/09/rag-users-want-affo...
import thing from everything
thing()Really dislike this type of content...
Supermancho•1mo ago
wormpilled•1mo ago
fnord77•1mo ago
wqaatwt•1mo ago
TheLNL•1mo ago