Like many people, I was frustrated that the released Epstein/Maxwell court documents were mostly scanned images (PDFs) with no text layer. This made them impossible to Ctrl+F or analyze programmatically.
I built a pipeline to fix this using Python, Tesseract, and OpenSearch.
The Site: https://epsteinfilez.com
The Stack:
Ingestion: Python workers using ocrmypdf (Tesseract) to perform parallel OCR on raw files.
Search: OpenSearch for indexing the extracted text.
Frontend: Next.js (SSR) for the UI.
Infrastructure: Self-hosted Docker swarm.
Features:
Sub-second full-text search across all files.
Highlights search terms directly on the PDF page.
Deep linking to specific pages/documents.
This is a transparency tool, not a political one. I wanted to make the raw primary sources accessible to researchers and journalists.
Feedback on the search relevance or indexing pipeline is welcome!