Processing PDFs in production usually involves stringing together brittle OCR heuristics. While recent multimodal embeddings (like ColModernVBERT or ColPali) allow you to skip OCR entirely and retrieve directly from visual layouts, we wanted to measure if the computational overhead is actually worth the utility.
The short answer: Transformer-based image pipelines won't be perfect for every use-case, but they fix exactly what OCR breaks.
Here is what we found benchmarking 3,230 pages of dense scientific literature:
Complementary Bottlenecks: Text representations (BM25 + dense vectors) are highly efficient for exact lexical constraints (e.g., finding a specific acronym like "HyDE"). Conversely, image embeddings shine on spatial architecture diagrams and t-SNE plots where OCR serialization just turns into structural garbage.
Multimodal Hybrid Search: Because these failure modes are almost perfectly orthogonal, fusing the two signals gives you the best performance out of the box. By combining them, we pushed top-1 recall to 49% (beating text alone at 46%).
The Memory Constraint: Late-interaction image embeddings produce thousands of vectors per page, creating a massive storage bottleneck. To address this need, we evaluate MUVERA encoding. Under the hood, this compresses multi-vector representations into a single fixed-dimensional encoding via SimHash, allowing you to use standard HNSW indexing without the paralyzing memory overhead.
In practice, if you are building a RAG workflow today, text-based context still provides higher downstream utility for the actual generation step (0.82 vs 0.71 alignment). Instead of picking one modality and dealing with its blind spots, start with hybrid text search as a sensible default, and inject multi-vector image embeddings to catch the visual edge-cases.
We’ve open-sourced the benchmark and the evaluation recipes:
Paper https://arxiv.org/abs/2602.17687 IRPAPERS dataset on HuggingFace at huggingface.co/weaviate/IRPAPERS and GitHub at github.com/weaviate/IRPAPERS
Our experimental code is also available on GitHub at github.com/weaviate/query-agent-benchmarking
Happy to answer any questions about the evaluation pipeline, the cold start problem of visual benchmarks, or the specific retrieval trade-offs we saw.