Production RAG: what I learned from processing 5M+ documents

https://blog.abdellatif.io/production-rag-processing-5m-documents

146•tifa2up•2h ago

Comments

manishsharan•2h ago

Thanks for sharing. TIL about rerankers.

Chunking strategy is a big issue. I found acceptable results by shoving large texts to to gemini flash and have it summarize and extract chunks instead of whatever text splitter I tried. I use the method published by Anthropic https://www.anthropic.com/engineering/contextual-retrieval i.e. include full summary along with chunks for each embedding.

I also created a tool to enable the LLM to do vector search on its own .

I do not use Langchain or python.. I use Clojure+ LLMs' REST APIs.

esafak•1h ago

Have you measured your latency, and how sensitive are you to it?

manishsharan•1h ago

>> Have you measured your latency, and how sensitive are you to it?

Not sensitive to latency at all. My users would rather have well researched answers than poor answers.

Also, I use batch mode APIs for chunking .. it is so much cheaper.

jascha_eng•1h ago

I have a RAG setup that doesn't work on documents but other data points that we use for generation (the original data is call recordings but it is heavily processed to just a few text chunks). Instead of a reranker model we do vector search and then simply ask GPT-5 in an extra call which of the results is the most relevant to the input question. Is there an advantage to actual reranker models rather than using a generic LLM?

tifa2up•1h ago

OP here. rerankers are finetuned small models, they're cheap and very fast compared to an additional GPT-5 call.

jascha_eng•1h ago

It's an async process in my case (custom deep research like) so speed is not that critical

esafak•1h ago

They say the chunker is the most important part, but theirs looks rudimentary: https://github.com/agentset-ai/agentset/blob/main/packages/e...

That is, there is nothing here that one could not easily write without a library.

tifa2up•1h ago

OP here. We've been working on agentset.ai full-time for 2 months. The product now gets you something working quite well out of the box. Better than most people with no experience in RAG (I'd say so with confidence).

Ingestion + Agentic Search are two areas that we're focused on in the short term.

teraflop•1h ago

I'm not sure there is a chunker in this repo. The file you linked certainly doesn't seem to perform any chunking, it just defines a data model for chunks.

The only place I see that actually operates on chunks does so by fetching them from Redis, and AFAICT nothing in the repo actually writes to Redis, so I assume the chunker is elsewhere.

https://github.com/agentset-ai/agentset/blob/main/packages/j...

alexchantavy•1h ago

> What moved the needle: Query Generation

What does query generation mean in this context, it’s probably not SQL queries right?

daemonologist•1h ago

It's described in the remainder of the point - they use an LLM to generate additional search queries, either rephrasings of the user's query or bringing additional context from the chat history.

goleary•1h ago

Here's an interesting read on the evolution beyond RAG: https://www.nicolasbustamante.com/p/the-rag-obituary-killed-...

One of the key features in Claude Code is "Agentic Search" aka using (rip)grep/ls to search a codebase without any of the overhead of RAG.

Sounds like even RAG approaches use a similar approach (Query Generation).

andreasgl•1h ago

I think they mean query expansion: https://en.wikipedia.org/wiki/Query_expansion

nextworddev•1h ago

Exactly what kind of processing was done? Your pipeline is a function of the use case, lest you overengineer…

js98•1h ago

Similar writeup I did about 1.5 years ago for processing millions of (technical) pages for RAG. Lots has stayed the same it seems

https://jakobs.dev/learnings-ingesting-millions-pages-rag-az...

winstonp•25m ago

I also built a RAG system about a year back for technical search, everything seems the same!

daemonologist•1h ago

I concur:

The big LLM-based rerankers (e.g. Qwen3-reranker) are what you always wanted your cross-encoder to be, and I highly recommend giving them a try. Unfortunately they're also quite computationally expensive.

Your metadata/tabular data often contains basic facts that a human takes for granted, but which aren't repeated in every text chunk - injecting it can help a lot in making the end model seem less clueless.

The point about queries that don't work with simple RAG (like "summarize the most recent twenty documents") is very important to keep in mind. We made our UI very search-oriented and deemphasized the chat, to try to communicate to users that search is what's happening under the hood - the model only sees what you see.

thethimble•49m ago

I wish there was more info on the article about actual customer usage - particularly whether it improved process efficiency. It's great to focus on the technical aspects of system optimization but unless this translates to tangible business value it's all just hype.

leetharris•1h ago

Embedding based RAG will always just be OK at best. It is useful for little parts of a chain or tech demos, but in real life use it will always falter.

sgt•1h ago

What do you recommend? Query generation?

esafak•1h ago

Compared with what?

charcircuit•1h ago

Most of my ChatGPT queries use RAG (based on the query ChatGPT will decide if it needs to search the web) to get up to date information about the world. In reality life it's effective and it's why every large provider supports it.

underlines•58m ago

rag will be pronounced differently ad again and again. it has its use cases. we moved to agentic search having rag as a tool while other retrieval strategies we added use real time search in the sources. often skipping ingested and chunked soueces. large changes next windows allow for putting almost whole documents into one request.

phillipcarter•48m ago

Not necessarily? It's been the basis of one of the major ways people would query their data since 2023 on a product I worked on: https://www.honeycomb.io/blog/introducing-query-assistant

The difference is this feature explicitly isn't designed to do a whole lot, which is still the best way to build most LLM-based products and sandwich it between non-LLM stuff.

mediaman•1h ago

The point about synthetic query generation is good. We found users had very poor queries, so we initially had the LLM generate synthetic queries. But then we found that the results could vary widely based on the specific synthetic query it generated, so we had it create three variants (all in one LLM call, so that you can prompt it to generate a wide variety, instead of getting three very similar ones back), do parallel search, and then use reciprocal rank fusion to combine the list into a set of broadly strong performers. For the searches we use hybrid dense + sparse bm25, since dense doesn't work well for technical words.

This, combined with a subsequent reranker, basically eliminated any of our issues on search.

avereveard•28m ago

final tip is to also feed the interpretation of the user search to the user on the other side, so he can check if the llm understanding was correct.

deepsquirrelnet•27m ago

> For the searches we use hybrid dense + sparse bm25, since dense doesn't work well for technical words.

One thing I’m always curious about is if you could simplify this and get good/better results using SPLADE. The v3 models look really good and seem to provide a good balance of semantic and lexical retrieval.

siva7•18m ago

Boy, that should not be the concern of the end user (developer) but those implementing RAG solutions as a service at Amazon, Microsoft, Openai and so on.

n_u•1h ago

> Reranking: the highest value 5 lines of code you'll add. The chunk ranking shifted a lot. More than you'd expect. Reranking can many times make up for a bad setup if you pass in enough chunks. We found the ideal reranker set-up to be 50 chunk input -> 15 output.

What is re-ranking in the context of RAG? Why not just show the code if it’s only 5 lines?

tifa2up•1h ago

OP. Reranking is a specialized LLM that takes the user query, and a list of candidate results, then re-sets the order based on which ones are more relevant to the query.

Here's sample code: https://docs.cohere.com/reference/rerank

yahoozoo•9m ago

What is the difference between reranking versus generating text embeddings and comparing with cosine similarity?

383toast•59m ago

They should've tested other embedding models, there are better ones than openai's (and cheaper)

prettyblocks•54m ago

Which do you suggest?

roze_sha•41m ago

https://huggingface.co/spaces/mteb/leaderboard

383toast•32m ago

yep

hatmanstack•38m ago

Not here to schlep for AWS but S3 Vectors is hands down the SOTA here. That combined with a Bedrock Knowledge Base to handle Discovery/Rebalance tasks makes for the simplest implementation on the Market.

Once Bedrock KB backed by S3 Vectors is released from Beta it'll eat everybody's lunch.

arcanemachiner•25m ago

Shill, not schlep.

I'm correcting you less out of pedantry, and more because I find the correct term to be funny.

hatmanstack•16m ago

I feel like I'm schelpin' through these comments, it's all mishigas

esafak•7m ago

You feel like a schlemiel, perhaps?

pietz•22m ago

I find it interesting that so many services and tools were investigated except for embedding models. I would have thought that's one of the biggest levers.

Trias11•20m ago

they just grabbed the better one (3-large) right off the bat. 6x cost to 3-small, but it's still tiny.

Poland's president signs zero income tax law for parents with two children

The Trap at 26 Federal Plaza

Deep-Sea Fish Has Teeth on Its Forehead and Uses Them for Sex

Building AR Mario Kart for Real Mountains (With Raspberry Pi Prototype)

Long Covid Is Real – and It's Changing a Generation

Building a static directory site with AI tools

Interview with a New Hosting Provider

Show HN: ComfortView – Text-only reading view with research-backed design

There is life after Git

Qualcomm Snapdragon chips can't use one of Android's best features

The Demise of the Flynn Effect

100 Lost Species

Smarter AI Code Reviews for GitLab Mrs: ThinkReview's Alternative to GitLab Duo

Claude Code on the Web

Peanut Allergies Have Plummeted in Children

Musk's $1T Tesla pay plan draws some protest ahead of likely approval

American Airlines Subsidiary Envoy Air Hit by Oracle Hack

An Unexpected Benefit from Quitting Coffee – 10 Months In

SIM Farm Dismantled in Europe, Seven Arrested

The Hidden and Dangerous Cost of AI in Software Engineering

Characterizing Fractured Zones in Urban Karst Geology Using Leaky Surface Waves

Doomsday Church Renovating Popular Yellowstone Hot Springs

China's generative AI user base doubles to 515M (36.5%) in 6 months

Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Topic-based journaling instead of daily notes

Smart Glasses Are Forcing Wearables to Get Weird

Supabase us-east-1 down

London Became a Global Hub for Phone Theft. Now We Know Why

x86-64 Playground – An online assembly editor and GDB-like debugger

The internet is dying on the outside but growing on the inside

Production RAG: what I learned from processing 5M+ documents

Comments

Poland's president signs zero income tax law for parents with two children

The Trap at 26 Federal Plaza

Deep-Sea Fish Has Teeth on Its Forehead and Uses Them for Sex

Building AR Mario Kart for Real Mountains (With Raspberry Pi Prototype)

Long Covid Is Real – and It's Changing a Generation

Building a static directory site with AI tools

Interview with a New Hosting Provider

Show HN: ComfortView – Text-only reading view with research-backed design

There is life after Git

Qualcomm Snapdragon chips can't use one of Android's best features

The Demise of the Flynn Effect

100 Lost Species

Smarter AI Code Reviews for GitLab Mrs: ThinkReview's Alternative to GitLab Duo

Claude Code on the Web

Peanut Allergies Have Plummeted in Children

Musk's $1T Tesla pay plan draws some protest ahead of likely approval

American Airlines Subsidiary Envoy Air Hit by Oracle Hack

An Unexpected Benefit from Quitting Coffee – 10 Months In

SIM Farm Dismantled in Europe, Seven Arrested

The Hidden and Dangerous Cost of AI in Software Engineering

Characterizing Fractured Zones in Urban Karst Geology Using Leaky Surface Waves

Doomsday Church Renovating Popular Yellowstone Hot Springs

China's generative AI user base doubles to 515M (36.5%) in 6 months

Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Topic-based journaling instead of daily notes

Smart Glasses Are Forcing Wearables to Get Weird

Supabase us-east-1 down

London Became a Global Hub for Phone Theft. Now We Know Why

x86-64 Playground – An online assembly editor and GDB-like debugger

The internet is dying on the outside but growing on the inside