frontpage.

Show HN: Irpapers – Visual embeddings vs. OCR trade-offs in scientific PDFs

https://github.com/weaviate/query-agent-benchmarking

4•pvpv•1h ago

Hey HN, we are releasing IRPAPERS to answer a highly pragmatic question: when building a RAG pipeline over PDFs, should you OCR the text or just embed the raw page images?

Processing PDFs in production usually involves stringing together brittle OCR heuristics. While recent multimodal embeddings (like ColModernVBERT or ColPali) allow you to skip OCR entirely and retrieve directly from visual layouts, we wanted to measure if the computational overhead is actually worth the utility.

The short answer: Transformer-based image pipelines won't be perfect for every use-case, but they fix exactly what OCR breaks.

Here is what we found benchmarking 3,230 pages of dense scientific literature:

Complementary Bottlenecks: Text representations (BM25 + dense vectors) are highly efficient for exact lexical constraints (e.g., finding a specific acronym like "HyDE"). Conversely, image embeddings shine on spatial architecture diagrams and t-SNE plots where OCR serialization just turns into structural garbage.

Multimodal Hybrid Search: Because these failure modes are almost perfectly orthogonal, fusing the two signals gives you the best performance out of the box. By combining them, we pushed top-1 recall to 49% (beating text alone at 46%).

The Memory Constraint: Late-interaction image embeddings produce thousands of vectors per page, creating a massive storage bottleneck. To address this need, we evaluate MUVERA encoding. Under the hood, this compresses multi-vector representations into a single fixed-dimensional encoding via SimHash, allowing you to use standard HNSW indexing without the paralyzing memory overhead.

In practice, if you are building a RAG workflow today, text-based context still provides higher downstream utility for the actual generation step (0.82 vs 0.71 alignment). Instead of picking one modality and dealing with its blind spots, start with hybrid text search as a sensible default, and inject multi-vector image embeddings to catch the visual edge-cases.

We’ve open-sourced the benchmark and the evaluation recipes:

Paper https://arxiv.org/abs/2602.17687 IRPAPERS dataset on HuggingFace at huggingface.co/weaviate/IRPAPERS and GitHub at github.com/weaviate/IRPAPERS

Our experimental code is also available on GitHub at github.com/weaviate/query-agent-benchmarking

Happy to answer any questions about the evaluation pipeline, the cold start problem of visual benchmarks, or the specific retrieval trade-offs we saw.

Donut Lab's solid-state battery gets its first test result

A lithium-ion breakthrough that could boost range and lower costs

A visual summary of the 5 prerequisites for improvement

Zwasm: A fast, spec-compliant WebAssembly runtime written in Zig

Americans are destroying Flock surveillance cameras

Life at the Frontlines of Demographic Collapse

I analyzed hundreds of humans vs. AI Tetris games, here's what I found

Real-time security reasoning inside your IDE

Fuss: OverlayFS Without Mounting

Alleged Distillation Attacks by DeepSeek, Moonshot AI, and MiniMax

ESR posits that the C-era is reaching its natural conclusion

Show HN: Emotica – AI that analyzes your emotions instead of just tracking them

Muscle Cathepsin B Improves Neurogenic Deficits in Mouse Alzheimer's Disease

Show HN: I rebuilt my hobby mapping platform

Waymo Is Destroying Tesla's Self-Driving Dreams

Anthropic: Industrial-scale distillation attacks on our models by Chinese AI

Neural Correlates of Envy and Schadenfreude

One Lib to Rule Them All: Why we build oneringai open source agentic AI library

Issues with "C99 implementation of new O(m log^(2/3) n) shortest path algorithm"

The Future of Social Media Is Human

AWS suffered 'at least two outages' caused by AI tools

Show HN: MachineAuth:open source Google login for your AI Agent

Is this cloud/local boundary for trading infra reasonable?

Zoye – The First AI Native Workspace for All Your Business Tools

The British get a nosebleed when they get too successful

Liver exerkine reverses Alzheimer's-related memory loss via vasculature

Show HN: Shibuya – A High-Performance WAF in Rust with eBPF and ML Engine

The Era of AI human clone

Show HN: I built a tool track cash flow without the "spreadsheet stress"

Baudbot: Always-on AI assistant for dev teams