frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

OCR and AI Pipeline over 2.7M Pages with Full-Text Search and Chat

https://epstein-file-explorer.com/
2•VibeCodingFG•1h ago

Comments

VibeCodingFG•1h ago
I vibecoded and built a document ingestion and search system over ~1.3M public PDFs (2.7M+ pages total).

The original goal was to extract structured information (people, places, relationships) via OCR + AI analysis. The pipeline looks roughly like this:

• Bulk ingest via torrent (aria2c) • Normalize + upload to Cloudflare R2 • OCR and text extraction • Media classification + AI prioritization • AI analysis (DeepSeek) for entity extraction • Load structured results into PostgreSQL • Generate person↔document and person↔person relationships

At this scale, the bottleneck wasn’t storage — it was AI cost and throughput.

Running full AI enrichment over millions of pages is slow and compute-intensive. Because of that, only portions of the corpus were initially enriched with structured metadata.

To avoid discoverability being limited by enrichment progress, I added:

• Full-text search across all 2.7M+ pages • Page-level deep linking into source documents • A conversational “Ask the Archive” feature that retrieves from the indexed corpus

The architecture today is:

Ingestion → OCR → Indexed text store → AI enrichment (incremental) → Postgres for relationships → Search + Graph + Chat layer

Some of the more interesting challenges:

• Entity collision from messy OCR output • Preventing high-degree “hub” entities from polluting graph queries • Incremental reprocessing when improving extraction • Balancing precomputed graph edges vs. query-time joins • Handling burst traffic (20–55k visits/day)

I’d appreciate feedback on: 1. Whether moving relationship storage to a graph-native DB would make sense long-term 2. Better strategies for incremental AI enrichment at this scale 3. Techniques to reduce noisy edge generation in large document graphs

Happy to answer technical questions.

My Courses Site Is Moving to a New Home

https://blog.miguelgrinberg.com/post/my-courses-site-is-moving-to-a-new-home
1•nomdep•5m ago•0 comments

Experiments with Voice Control on Linux

https://blog.ricky0123.com/blog/voice/
1•ricky0123•6m ago•0 comments

Weekly Claw: OpenClaw community's weekly voice chat. 2/15 4PM ET

https://www.wetheclaw.org/
1•fractalnetworks•6m ago•1 comments

Goodbye Solar Panels: This Tiny Wind Turbine Is Perfect for Mobile Power

https://www.bgr.com/2093511/tiny-wind-turbine-mobile-portable-energy/
2•thelastgallon•9m ago•0 comments

The Sweet Lesson of Neuroscience

https://asteriskmag.com/issues/13/the-sweet-lesson-of-neuroscience
1•yorwba•11m ago•0 comments

A retrospective on 9 months with coding agents

https://bertolami.com/index.php?engine=blog&content=posts&detail=cost-effective-agentic-coding
3•freshtake•12m ago•0 comments

Researchers find nitrogen boost spurs faster tropical forest growth

https://news.mongabay.com/2026/01/blew-us-away-researchers-find-nitrogen-boost-spurs-faster-tropi...
1•PaulHoule•13m ago•0 comments

Measuring Nighttime Light Exposure Across Major European and US Cities

https://geoform.io/cities-that-never-sleep/
1•jmech•14m ago•1 comments

Ask HN: Share your vibe coded project

1•firefoxd•14m ago•1 comments

The Neuro-Data Bottleneck: Why Neuro-AI Interfacing Breaks the Modern Data Stack

https://datachain.ai/blog/neuro-data-bottleneck
1•gptguy•15m ago•0 comments

WP Multitool Find what's slowing your WordPress. Fix it

https://wpmultitool.com/
1•taubek•15m ago•0 comments

Radio host David Greene says Google's AI podcast tool stole his voice

https://www.washingtonpost.com/technology/2026/02/15/david-greene-google-ai-podcast/
1•mikhael•16m ago•0 comments

Ask HN: What's the best realtime, local, TTS solution? Live call interpretation

1•Wright007•17m ago•0 comments

AI film school trains next generation of Hollywood moviemakers

https://www.reuters.com/business/media-telecom/ai-film-school-trains-next-generation-hollywood-mo...
4•devonnull•17m ago•0 comments

Show HN: Djevops – A CLI tool for hosting Django on bare metal

https://github.com/mherrmann/djevops
1•mherrmann•17m ago•0 comments

Modern CSS Code Snippets: Stop writing CSS like it's 2015

https://modern-css.com
1•eustoria•17m ago•0 comments

Pinchtab – 12MB Go Binary for AI Browser for OpenClaw

https://github.com/pinchtab/pinchtab
1•tengio•20m ago•1 comments

Do you need an admin party to get your life back in order?

https://www.rnz.co.nz/life/lifestyle/do-you-need-an-admin-party-to-get-your-life-back-in-order
4•billybuckwheat•21m ago•0 comments

Extending Large Language Models to multimodality for non-English languages

https://www.sciencedirect.com/science/article/pii/S1077314225003418
1•saikatsg•23m ago•0 comments

Where Does Ollama run glm-5:cloud Run? And other Security Blunders

https://docs.ollama.com/cloud
2•coolguysailer•23m ago•1 comments

Fontstand International Typography Conference 2026

https://fontstand.com/conference/2026
1•eustoria•23m ago•0 comments

Show HN: Apiosk – Self-service API marketplace with per-request USDC payments

https://apiosk.com
1•ollybrinkman•24m ago•0 comments

Video feedback fractal device to get an order of magnitude upgrade in resolution

https://www.thelightherder.com/2026/02/an-exciting-new-development-4k.html
1•thelightherder•25m ago•0 comments

Show HN: LaTeX Salon, a Trystero-based multiplayer LaTeX scratchpad

https://latex.salon
2•ashivkum•25m ago•0 comments

Show HN: Endlessh Fisher – Turn SSH tarpit bots into collectible fish

https://github.com/DarkWolfCave/endlessh-fisher
1•darkwolfcave•29m ago•1 comments

Show HN: Violit – Fine-grained reactive Python Web UI (Streamlit-alternative)

https://github.com/violit-dev/violit
1•dopeflamingo•30m ago•0 comments

IR USB device for Casio WQV-1 – the first camera watch

https://bsky.app/profile/partlyhuman.com/post/3mefdsvt5ys2n
1•thcipriani•31m ago•0 comments

Show HN: Deadend CLI – Open-source self-hosted agentic pentesting tool

https://github.com/xoxruns/deadend-cli
1•gemini-15•32m ago•0 comments

I Know What You Think of Me

https://archive.nytimes.com/opinionator.blogs.nytimes.com/2013/06/15/i-know-what-you-think-of-me/
1•Rendello•33m ago•0 comments

Tinder Hasn't Worked, So I'm Putting Myself on Zillow

https://www.mcsweeneys.net/articles/tinder-hasnt-worked-so-im-putting-myself-on-zillow
2•7777777phil•34m ago•0 comments