frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Irpapers – Visual embeddings vs. OCR trade-offs in scientific PDFs

https://github.com/weaviate/query-agent-benchmarking
4•pvpv•1h ago
Hey HN, we are releasing IRPAPERS to answer a highly pragmatic question: when building a RAG pipeline over PDFs, should you OCR the text or just embed the raw page images?

Processing PDFs in production usually involves stringing together brittle OCR heuristics. While recent multimodal embeddings (like ColModernVBERT or ColPali) allow you to skip OCR entirely and retrieve directly from visual layouts, we wanted to measure if the computational overhead is actually worth the utility.

The short answer: Transformer-based image pipelines won't be perfect for every use-case, but they fix exactly what OCR breaks.

Here is what we found benchmarking 3,230 pages of dense scientific literature:

Complementary Bottlenecks: Text representations (BM25 + dense vectors) are highly efficient for exact lexical constraints (e.g., finding a specific acronym like "HyDE"). Conversely, image embeddings shine on spatial architecture diagrams and t-SNE plots where OCR serialization just turns into structural garbage.

Multimodal Hybrid Search: Because these failure modes are almost perfectly orthogonal, fusing the two signals gives you the best performance out of the box. By combining them, we pushed top-1 recall to 49% (beating text alone at 46%).

The Memory Constraint: Late-interaction image embeddings produce thousands of vectors per page, creating a massive storage bottleneck. To address this need, we evaluate MUVERA encoding. Under the hood, this compresses multi-vector representations into a single fixed-dimensional encoding via SimHash, allowing you to use standard HNSW indexing without the paralyzing memory overhead.

In practice, if you are building a RAG workflow today, text-based context still provides higher downstream utility for the actual generation step (0.82 vs 0.71 alignment). Instead of picking one modality and dealing with its blind spots, start with hybrid text search as a sensible default, and inject multi-vector image embeddings to catch the visual edge-cases.

We’ve open-sourced the benchmark and the evaluation recipes:

Paper https://arxiv.org/abs/2602.17687 IRPAPERS dataset on HuggingFace at huggingface.co/weaviate/IRPAPERS and GitHub at github.com/weaviate/IRPAPERS

Our experimental code is also available on GitHub at github.com/weaviate/query-agent-benchmarking

Happy to answer any questions about the evaluation pipeline, the cold start problem of visual benchmarks, or the specific retrieval trade-offs we saw.

Donut Lab's solid-state battery gets its first test result

https://www.theverge.com/transportation/882993/donut-labs-solid-state-battery-charge-speed-vtt-test
1•thelastgallon•1m ago•0 comments

A lithium-ion breakthrough that could boost range and lower costs

https://www.techradar.com/vehicle-tech/hybrid-electric-vehicles/forget-solid-state-batteries-rese...
1•thelastgallon•2m ago•0 comments

A visual summary of the 5 prerequisites for improvement

https://mental-models.oldschoolburke.com/five-prerequisites/
1•zdosb•4m ago•1 comments

Zwasm: A fast, spec-compliant WebAssembly runtime written in Zig

https://github.com/clojurewasm/zwasm
1•jedisct1•5m ago•0 comments

Americans are destroying Flock surveillance cameras

https://techcrunch.com/2026/02/23/americans-are-destroying-flock-surveillance-cameras/
1•mikece•5m ago•0 comments

Life at the Frontlines of Demographic Collapse

https://www.lesswrong.com/posts/FreZTE9Bc7reNnap7/life-at-the-frontlines-of-demographic-collapse
1•reducesuffering•7m ago•0 comments

I analyzed hundreds of humans vs. AI Tetris games, here's what I found

https://www.a16z.news/p/i-built-tetrisbench-where-llms-compete
1•ykhli•7m ago•0 comments

Real-time security reasoning inside your IDE

https://open-vsx.org/extension/DevSecAI/Arko
1•mlnas•8m ago•1 comments

Fuss: OverlayFS Without Mounting

https://writethat.blog/fuss.html
2•psarna•10m ago•0 comments

Alleged Distillation Attacks by DeepSeek, Moonshot AI, and MiniMax

https://twitter.com/anthropicai/status/2025997929840857390
5•mike_kamau•11m ago•0 comments

ESR posits that the C-era is reaching its natural conclusion

https://twitter.com/esrtweet/status/2026004594590089484
2•sgt•16m ago•0 comments

Show HN: Emotica – AI that analyzes your emotions instead of just tracking them

https://apps.apple.com/us/app/emotica-mood-tracker-diary/id6757162931
2•tirupati_balan•16m ago•1 comments

Muscle Cathepsin B Improves Neurogenic Deficits in Mouse Alzheimer's Disease

https://onlinelibrary.wiley.com/doi/10.1111/acel.70242
3•bookofjoe•16m ago•0 comments

Show HN: I rebuilt my hobby mapping platform

https://trippi.app
2•velmu•18m ago•0 comments

Waymo Is Destroying Tesla's Self-Driving Dreams

https://neuralfoundry.substack.com/p/waymo-is-destroying-teslas-self-driving
4•truenfel•21m ago•0 comments

Anthropic: Industrial-scale distillation attacks on our models by Chinese AI

https://twitter.com/i/status/2025997928242811253
6•mudil•21m ago•1 comments

Neural Correlates of Envy and Schadenfreude

https://www.science.org/doi/10.1126/science.1165604
2•toomuchtodo•21m ago•1 comments

One Lib to Rule Them All: Why we build oneringai open source agentic AI library

https://medium.com/superstringtheory/one-library-to-rule-them-all-why-we-built-oneringai-689f9048...
2•jhoxray•21m ago•0 comments

Issues with "C99 implementation of new O(m log^(2/3) n) shortest path algorithm"

https://github.com/danalec/DMMSY-SSSP/issues/1
2•dunmalg•26m ago•0 comments

The Future of Social Media Is Human

https://blog.picheta.me/post/the-future-of-social-media-is-human/
1•dom96•26m ago•0 comments

AWS suffered 'at least two outages' caused by AI tools

https://www.tomsguide.com/computing/aws-suffered-at-least-two-outages-caused-by-ai-tools-and-now-...
2•randycupertino•26m ago•2 comments

Show HN: MachineAuth:open source Google login for your AI Agent

https://github.com/mandarwagh9/MachineAuth
2•mandarwagh•27m ago•0 comments

Is this cloud/local boundary for trading infra reasonable?

3•Sultan_Custodia•27m ago•0 comments

Zoye – The First AI Native Workspace for All Your Business Tools

https://zoye.io/
3•anizeu•27m ago•1 comments

The British get a nosebleed when they get too successful

https://www.reaction.life/p/the-british-get-a-nosebleed-when
2•ossa-ma•30m ago•0 comments

Liver exerkine reverses Alzheimer's-related memory loss via vasculature

https://www.sciencedirect.com/science/article/pii/S009286742600111X
6•PaulHoule•33m ago•1 comments

Show HN: Shibuya – A High-Performance WAF in Rust with eBPF and ML Engine

https://ghostklan.com/shibuya.html
4•germainluperto•33m ago•0 comments

The Era of AI human clone

2•Metalcode•34m ago•0 comments

Show HN: I built a tool track cash flow without the "spreadsheet stress"

https://www.opboard.io/
2•wwxoxo•34m ago•1 comments

Baudbot: Always-on AI assistant for dev teams

https://github.com/modem-dev/baudbot
2•tosh•35m ago•0 comments