Show HN: RAG-powered search tool for 20k+ Epstein files

3•benbaessler•1mo ago

I built epfiles.ai to make the U.S. House Oversight Epstein document release actually searchable.

The files exist publicly, but they're scattered across nested Google Drive folders in mixed formats: PDFs, images, scanned documents. Manually searching through 20,000+ files is impractical for most people.

This tool lets you query the corpus in natural language. Every answer includes clickable citations to the exact source page, so you can verify against the original. The goal is document discovery, not replacing human verification.

Technical approach: - OCR'd the entire corpus - Chunked and embedded for semantic search - RAG pipeline returns relevant passages with source links - Citations point directly to the House Oversight Committee's Google Drive

I built this because I think public document releases should be usable, not just technically available. Happy to answer questions about the approach.

Demo: https://youtube.com/watch?v=7sQgRvwK3LE

Comments

N_Lens•1mo ago

I'd love to pose some quick questions (If I had more time) that collate the relevant data from the files, such as the contents linking DJT, or any other high profile individual.

benbaessler•1mo ago

Yes that should definitely be doable, let me know if you need any help with anything.

benbaessler•1mo ago

Here's what it does: - Natural-language search over the full corpus - Results are document discovery, not “trust me” summaries - Every result includes clickable citations that jump to the exact source page in the committee’s Google Drive so you can verify context quickly

Some useful test queries: - “Find documents mentioning [person/org] in connection with flights / schedules / contacts” - “Show mentions of ‘massage’, ‘modeling’, ‘recruiting’, ‘Palm Beach’, ‘New York’, ‘Little St. James’” - “What documents reference [date range] and [location]”

Known limitations: - OCR noise on low-quality scans - Names/aliases can be inconsistent; citations are the ground truth

The AI Talent War Is for Plumbers and Electricians

Show HN: MimiClaw, OpenClaw(Clawdbot)on $5 Chips

I Maintain My Blog in the Age of Agents

The Fall of the Nerds

I'm 15 and built a free tool for reading Greek/Latin texts. Would love feedback

How close is AI to taking my job?

You are the reason I am not reviewing this PR

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

How Meta Made Linux a Planet-Scale Load Balancer

A Turing Test for AI Coding

How to Identify and Eliminate Unused AWS Resources

A2CDVI – HDMI output from from the Apple IIc's digital video output connector

CLI for Common Playwright Actions

Would you use an e-commerce platform that shares transaction fees with users?

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

The Evolution of the Interface

Azure: Virtual network routing appliance overview

Seedance2 – multi-shot AI video generation

Πfs – The Data-Free Filesystem

Go-busybox: A sandboxable port of busybox for AI agents

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery [pdf]

xAI Merger Poses Bigger Threat to OpenAI, Anthropic

Atlas Airborne (Boston Dynamics and RAI Institute) [video]

Zen Tools

Is the Detachment in the Room? – Agents, Cruelty, and Empathy

The purpose of Continuous Integration is to fail

Apfelstrudel: Live coding music environment with AI agent chat

What Is Stoicism?

What happens when a neighborhood is built around a farm