I vibecoded and built a document ingestion and search system over ~1.3M public PDFs (2.7M+ pages total).
The original goal was to extract structured information (people, places, relationships) via OCR + AI analysis. The pipeline looks roughly like this:
• Bulk ingest via torrent (aria2c)
• Normalize + upload to Cloudflare R2
• OCR and text extraction
• Media classification + AI prioritization
• AI analysis (DeepSeek) for entity extraction
• Load structured results into PostgreSQL
• Generate person↔document and person↔person relationships
At this scale, the bottleneck wasn’t storage — it was AI cost and throughput.
Running full AI enrichment over millions of pages is slow and compute-intensive. Because of that, only portions of the corpus were initially enriched with structured metadata.
To avoid discoverability being limited by enrichment progress, I added:
• Full-text search across all 2.7M+ pages
• Page-level deep linking into source documents
• A conversational “Ask the Archive” feature that retrieves from the indexed corpus
The architecture today is:
Ingestion → OCR → Indexed text store → AI enrichment (incremental) → Postgres for relationships → Search + Graph + Chat layer
Some of the more interesting challenges:
• Entity collision from messy OCR output
• Preventing high-degree “hub” entities from polluting graph queries
• Incremental reprocessing when improving extraction
• Balancing precomputed graph edges vs. query-time joins
• Handling burst traffic (20–55k visits/day)
I’d appreciate feedback on:
1. Whether moving relationship storage to a graph-native DB would make sense long-term
2. Better strategies for incremental AI enrichment at this scale
3. Techniques to reduce noisy edge generation in large document graphs
VibeCodingFG•1h ago
The original goal was to extract structured information (people, places, relationships) via OCR + AI analysis. The pipeline looks roughly like this:
• Bulk ingest via torrent (aria2c) • Normalize + upload to Cloudflare R2 • OCR and text extraction • Media classification + AI prioritization • AI analysis (DeepSeek) for entity extraction • Load structured results into PostgreSQL • Generate person↔document and person↔person relationships
At this scale, the bottleneck wasn’t storage — it was AI cost and throughput.
Running full AI enrichment over millions of pages is slow and compute-intensive. Because of that, only portions of the corpus were initially enriched with structured metadata.
To avoid discoverability being limited by enrichment progress, I added:
• Full-text search across all 2.7M+ pages • Page-level deep linking into source documents • A conversational “Ask the Archive” feature that retrieves from the indexed corpus
The architecture today is:
Ingestion → OCR → Indexed text store → AI enrichment (incremental) → Postgres for relationships → Search + Graph + Chat layer
Some of the more interesting challenges:
• Entity collision from messy OCR output • Preventing high-degree “hub” entities from polluting graph queries • Incremental reprocessing when improving extraction • Balancing precomputed graph edges vs. query-time joins • Handling burst traffic (20–55k visits/day)
I’d appreciate feedback on: 1. Whether moving relationship storage to a graph-native DB would make sense long-term 2. Better strategies for incremental AI enrichment at this scale 3. Techniques to reduce noisy edge generation in large document graphs
Happy to answer technical questions.