Ask HN: How would you architect a RAG system for 10M+ documents today?

23•Ftrea•2mo ago

I'm tasked with building a private AI assistant for a corpus of 10 million text documents (living in PostgreSQL). The goal is semantic search and chat, with a requirement for regular incremental updates.

I'm trying to decide between:

Bleeding edge: Implementing something like LightRAG or GraphRAG.

Proven stack: Standard Hybrid Search (Weaviate/Elastic + Reranking) orchestrated by tools like Dify.

For those who have built RAG at this scale:

What is your preferred stack for 2025?

Is the complexity of Graph/LightRAG worth it over standard chunking/retrieval for this volume?

How do you handle maintenance and updates efficiently?

Looking for architectural advice and war stories.

Comments

parentheses•2mo ago

If it's < 100M, with vectors of 1024 size, you could fit all of that in ~100G of memory. So, maybe storing it in memory is an easy way to go about it. This ignores a lot of "database problems". If the docs are changing constantly, or uou have other scalability concerns, you may be better off using a "proper" vector db. There have been HN postings which indicate vector db choice matters. Do your research there.

Ftrea•2mo ago

Agreed. Pure in-memory is too risky for us given the persistence requirements and monthly updates. We are definitely going with a 'proper' DB (likely Postgres+pgvector or Weaviate) to handle the state and updates reliably.

walpurginacht•2mo ago

do you have an evaluation in place that necessitates complex stuffs? If not I'd start simple with proven stuffs and collect usage data to determine what's next

Ftrea•2mo ago

This is the sanity check we needed. We don't have a benchmark yet necessitating complex graph architectures. We will stick to 'Proven Stuffs' first: A solid Hybrid Search (Vector + Keyword) baseline. We'll collect usage data and only complicate the stack if the baseline fails on specific queries.

journal•2mo ago

ranked hierarchical pagination and intermediate context control. also, text documents in database or text data in worth of 10 million documents? If you OCR, why not cache result? Also, Lucene white space tokenization is pretty good for dumb exact or close enough to get a filtered result that might fit the context windows better. imagine having to ocr and llm, instantly. i would do everything to avoid architecting a system like that. not sure if you're pointing the right end of the stick at the right problem. are you intending to max out your allowed context? what's going on here? you can usually extract rough set before you llm so ideally you'd never exceed 50% of context. How big of responses do you expect? you have a lot of options, just throw everything at the problem that's easy to implement first and see what sticks. make sure you got terminal access whereever you do this for max flexibility. i obviously prefer aspnet with psql. what kind of data do you need indexed? lets say you have something stupid like origin and destination based on locations and you need geo index and maybe zipcode database, and some intermediate step to calculate assets within radius, calc some distances and make a decision, adding geo to any problem is a nightmare, but fun, but only the first time. cause you know how to do it now but it takes so long you don't want to. if you have terminal and source you have enough space to maneuver updates, it'll end up being probably a one line to execute an update that takes some time to rebuild your solution and then it seems to automatically slide it under the working app i never experienced any problems. as for database schema changes, push out your production release to where the time between schema changes go down to less than 5% or something extreme but be aware there could be schema changes that are hard to implement even later, but after you're in production it's much harder.

Ftrea•2mo ago

Thanks for the tips. We are strictly doing offline processing (docs are already converted to Markdown stored in DB) to avoid any live OCR latency. Also 100% agreed on filtering—we plan to use metadata/keyword filters (Lucene style) to narrow down the search space before hitting the LLM context window. No intention to verify zipcodes though! :)

osigurdson•2mo ago

Are the documents individually large or fairly small - like a page or two each? If they are small docs since you already have Postgres, you can just add the pgvector extension determine what embeddings that you want to use and try it out without committing to much. Maybe add a hash column first so that you can avoid paying to compute the embeddings again if you decide to use a different approach. They are all basically doing the same math to find things so you aren't going to get magically better results with other things. If the docs are larger then you have to do chunking anyway.

Would the 10M documents be searched with a single vector search or would it be pre-filtered by other columns in your table first. If some prefiltering is happening it naturally make things faster. You will likely want to use regular text / tsvector based search as well and potentially feed the LLM with this as well since vector search isn't perfect.

You would then decide if you want to do re-ranking or not before handing it to the final LLM context window. These days, models are pretty good so they will do their own re-ranking to some extent but depends a bit on cost, latency and the quality of result that you are looking for.

Ftrea•2mo ago

This is extremely helpful. Our docs are indeed small (1-2 pages mostly), so distinct chunking might not even be needed—maybe one vector per doc or page. Since we are already on Postgres, pgvector + tsvector (for hybrid search) seems like the most logical MVP. Question: In your experience, does pgvector with HNSW indexes handle the 10M row scale with low latency (<200ms) for real-time chat? Or does a dedicated DB like Weaviate still offer a significant edge there?

mikert89•2mo ago

chunk the documents, use contextual embeddings, put into the vectordb in postgres

Can You Draw Every Flag in PowerPoint? (Part 2) [video]

Show HN: MCP-baepsae – MCP server for iOS Simulator automation

Make Trust Irrelevant: A Gamer's Take on Agentic AI Safety

Show HN: Sem – Semantic diffs and patches for Git

Hello world does not compile

Show HN: ZigZag – A Bubble Tea-Inspired TUI Framework for Zig

Metaphor+Metonymy: "To love that well which thou must leave ere long"(Sonnet73)

Show HN: Django N+1 Queries Checker

Emacs-tramp-RPC: High-performance TRAMP back end using JSON-RPC instead of shell

Protocol Validation with Affine MPST in Rust

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Show HN: Zest – A hands-on simulator for Staff+ system design scenarios

Show HN: DeSync – Decentralized Economic Realm with Blockchain-Based Governance

Automatic Programming Returns

Why Are There Still So Many Jobs? The History and Future of Workplace Automation [pdf]

The Search Engine Map

Show HN: Souls.directory – SOUL.md templates for AI agent personalities

Real-Time ETL for Enterprise-Grade Data Integration

Economics Puzzle Leads to a New Understanding of a Fundamental Law of Physics

Switzerland's Extraordinary Medieval Library

A new comet was just discovered. Will it be visible in broad daylight?

ESR: Comes the news that Anthropic has vibecoded a C compiler

Frisco residents divided over H-1B visas, 'Indian takeover' at council meeting

If CNN Covered Star Wars

Show HN: I built the first tool to configure VPSs without commands

AI agents from 4 labs predicting the Super Bowl via prediction market

EU bans infinite scroll and autoplay in TikTok case

Benchmarking how well LLMs can play FizzBuzz

Why I Joined OpenAI

Octave GTM MCP Server

Can You Draw Every Flag in PowerPoint? (Part 2) [video]

Show HN: MCP-baepsae – MCP server for iOS Simulator automation

Make Trust Irrelevant: A Gamer's Take on Agentic AI Safety

Show HN: Sem – Semantic diffs and patches for Git

Hello world does not compile

Show HN: ZigZag – A Bubble Tea-Inspired TUI Framework for Zig

Metaphor+Metonymy: "To love that well which thou must leave ere long"(Sonnet73)

Show HN: Django N+1 Queries Checker

Emacs-tramp-RPC: High-performance TRAMP back end using JSON-RPC instead of shell

Protocol Validation with Affine MPST in Rust

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Show HN: Zest – A hands-on simulator for Staff+ system design scenarios

Show HN: DeSync – Decentralized Economic Realm with Blockchain-Based Governance

Automatic Programming Returns

Why Are There Still So Many Jobs? The History and Future of Workplace Automation [pdf]

The Search Engine Map

Show HN: Souls.directory – SOUL.md templates for AI agent personalities

Real-Time ETL for Enterprise-Grade Data Integration

Economics Puzzle Leads to a New Understanding of a Fundamental Law of Physics

Switzerland's Extraordinary Medieval Library

A new comet was just discovered. Will it be visible in broad daylight?

ESR: Comes the news that Anthropic has vibecoded a C compiler

Frisco residents divided over H-1B visas, 'Indian takeover' at council meeting

If CNN Covered Star Wars

Show HN: I built the first tool to configure VPSs without commands

AI agents from 4 labs predicting the Super Bowl via prediction market

EU bans infinite scroll and autoplay in TikTok case

Benchmarking how well LLMs can play FizzBuzz

Why I Joined OpenAI

Octave GTM MCP Server

Ask HN: How would you architect a RAG system for 10M+ documents today?

Comments