We hit this building agentic workflows and RAG backends. What we needed wasn’t “search”, it was a way to retrieve real, structured full text with enough metadata to plug straight into a reasoning system. So we built a system that could do that: multimodal inputs (text, math, figures), clean citations, reference chaining, and filters that work (by date, by source, etc).
The hard part wasn’t retrieval but preprocessing at scale. Figuring out how to analyse, chunk, structure tens of millions of docs without taking months or breaking the bank. Not to mention dealing with licensed content where formats vary wildly or building retrieval systems at this scale.
Still a work in progress with more updates on the way. But miles better than duct-taping together PDFs, AI search engines etc. and hoping to find the relevant context you need.
yorkeccak•1d ago
yorkeccak•1d ago