OCR and AI Pipeline over 2.7M Pages with Full-Text Search and Chat

2•VibeCodingFG•1h ago

Comments

VibeCodingFG•1h ago

I vibecoded and built a document ingestion and search system over ~1.3M public PDFs (2.7M+ pages total).

The original goal was to extract structured information (people, places, relationships) via OCR + AI analysis. The pipeline looks roughly like this:

• Bulk ingest via torrent (aria2c) • Normalize + upload to Cloudflare R2 • OCR and text extraction • Media classification + AI prioritization • AI analysis (DeepSeek) for entity extraction • Load structured results into PostgreSQL • Generate person↔document and person↔person relationships

At this scale, the bottleneck wasn’t storage — it was AI cost and throughput.

Running full AI enrichment over millions of pages is slow and compute-intensive. Because of that, only portions of the corpus were initially enriched with structured metadata.

To avoid discoverability being limited by enrichment progress, I added:

• Full-text search across all 2.7M+ pages • Page-level deep linking into source documents • A conversational “Ask the Archive” feature that retrieves from the indexed corpus

The architecture today is:

Ingestion → OCR → Indexed text store → AI enrichment (incremental) → Postgres for relationships → Search + Graph + Chat layer

Some of the more interesting challenges:

• Entity collision from messy OCR output • Preventing high-degree “hub” entities from polluting graph queries • Incremental reprocessing when improving extraction • Balancing precomputed graph edges vs. query-time joins • Handling burst traffic (20–55k visits/day)

I’d appreciate feedback on: 1. Whether moving relationship storage to a graph-native DB would make sense long-term 2. Better strategies for incremental AI enrichment at this scale 3. Techniques to reduce noisy edge generation in large document graphs

Happy to answer technical questions.

My Courses Site Is Moving to a New Home

Experiments with Voice Control on Linux

Weekly Claw: OpenClaw community's weekly voice chat. 2/15 4PM ET

Goodbye Solar Panels: This Tiny Wind Turbine Is Perfect for Mobile Power

The Sweet Lesson of Neuroscience

A retrospective on 9 months with coding agents

Researchers find nitrogen boost spurs faster tropical forest growth

Measuring Nighttime Light Exposure Across Major European and US Cities

Ask HN: Share your vibe coded project

The Neuro-Data Bottleneck: Why Neuro-AI Interfacing Breaks the Modern Data Stack

WP Multitool Find what's slowing your WordPress. Fix it

Radio host David Greene says Google's AI podcast tool stole his voice

Ask HN: What's the best realtime, local, TTS solution? Live call interpretation

AI film school trains next generation of Hollywood moviemakers

Show HN: Djevops – A CLI tool for hosting Django on bare metal

Modern CSS Code Snippets: Stop writing CSS like it's 2015

Pinchtab – 12MB Go Binary for AI Browser for OpenClaw

Do you need an admin party to get your life back in order?

Extending Large Language Models to multimodality for non-English languages

Where Does Ollama run glm-5:cloud Run? And other Security Blunders

Fontstand International Typography Conference 2026

Show HN: Apiosk – Self-service API marketplace with per-request USDC payments

Video feedback fractal device to get an order of magnitude upgrade in resolution

Show HN: LaTeX Salon, a Trystero-based multiplayer LaTeX scratchpad

Show HN: Endlessh Fisher – Turn SSH tarpit bots into collectible fish

Show HN: Violit – Fine-grained reactive Python Web UI (Streamlit-alternative)

IR USB device for Casio WQV-1 – the first camera watch

Show HN: Deadend CLI – Open-source self-hosted agentic pentesting tool

I Know What You Think of Me

Tinder Hasn't Worked, So I'm Putting Myself on Zillow