Comprehensive searchable database of Epstein documents

3•jfyi•1h ago

Comments

jfyi•1h ago

I'm not associated with the project. I just think they are doing amazing work as of the recent document drops.

u1hcw9nx•1h ago

Just because the site says comprehensive does not mean it is comprehensive. Multiple names other databases find are not mentioned. Start from Joscha Bach...

DOJ has more comprehensive search functionality.

https://www.justice.gov/epstein/search

jfyi•46m ago

https://board.epsteinexposed.com/c/site-feedback/2

randlet•49m ago

/r/epstein post from the creator:

https://reddit.com/r/Epstein/comments/1r3joqr/i_mapped_every...

-------

A week ago I posted about an open database I’ve been building to cross reference Epstein case material. That post did way better than I expected (568k views, 4.6k upvotes) and it hugged my server to death twice.

Since then I basically did nothing but ingest, clean, and index more data. The database is now big enough that “just read the docs” is not advice, it’s a cry for help. What it was last week

    ~6,000 documents
    1,708 flights
    2,700 emails
    1,438 people

What it is now

    1,522,060 documents (all DOJ releases we have access to so far), full text searchable
    1,708 flights (1997 to 2019) with manifests where available
    10,000+ emails indexed with threading
    1,350 people (cleaned: removed duplicates + nuked a bunch of false connections)
    638,000 docs run through redaction analysis
        ~1.8M individual redactions detected
        ~616k flagged by our tooling as “looks questionable, take a closer look”
        ~39,500 pages of text recovered from under black bars (you can see examples on the site)
    107,000 named entities pulled out via NLP (people, orgs, places, dates)
    1,530 audio/video transcripts
    4,300+ photos/media (raid photos, exhibits, property shots, government releases)

That’s not a typo: 1.5 million documents. If you search a phrase, it searches inside the actual pages (OCR where needed) and email bodies, not just titles.

So what changed, besides “everything is bigger”? 1) The redaction stuff is getting hard to ignore

I’m not saying “every redaction is evil.” Some of them obviously protect victims, minors, addresses, etc. But the patterns are weird, and the volume is insane.

I also worked with u/Sea_Doughnut_8853, who independently processed 519k PDFs with their own pipeline. That let us sanity check a lot of what we’re seeing across the corpus.

We’re flagging ~616k redactions as “potentially improper” based on patterns (context, repetition, surrounding text). That does not mean “definitely corrupt.” It means “this is the pile worth human eyes.”

We also recovered a lot of hidden text. If you want to judge it yourself, the doc pages show the redaction density and any recovered text we can reliably extract. 2) Entity extraction is the only way to deal with this scale

107,000 entities means you can stop playing whack a mole with PDFs. It’s still not “truth,” it’s just structure. But structure beats drowning. 3) This week’s real world developments are in there too

If you missed the news cycle, Congress has been pressuring DOJ about redactions, and Rep. Ro Khanna read six previously redacted names on the House floor:

    Leslie Wexner
    Salvatore Nuara
    Zurab Mikeladze
    Leonic Leonov
    Nicola Caputo
    Sultan Ahmed bin Sulayem

Important caveat: being named in a document is not proof of wrongdoing. People show up in emails, contact lists, forwarded threads, or because someone mentioned them.

    Reporting says Wexner’s name appeared in an internal FBI document as “co conspirator,” but he has not been charged.
    Maxwell invoked the Fifth in a House Oversight deposition and her lawyer floated testimony in exchange for clemency.
    House Oversight depositions are scheduled: Wexner (Feb 18), Richard Kahn (Feb 25), Darren Indyke (Mar 5), plus Hillary Clinton (Feb 26) and Bill Clinton (Feb 27).

All of those items are indexed, with the underlying documents linked where available. New tools since last week

    Full text search: search inside 1.5M documents, 28k OCR entries, and 10k emails
    AI research assistant: ask a question in plain English, get an answer with citations back to the source docs so you can verify it yourself
    Degrees of separation: shortest documented path between two people, with the supporting flights/docs shown at each hop
    Redaction analysis on every doc page: how heavy, what got flagged, what got recovered
    Investigation Dossiers (new today): community made evidence boards
        pin any person/doc/flight/email
        add notes
        upvotes + comments
        “community notes” style fact checks
        sorting like hot/new/top
        I put up 14 starter dossiers so it’s not an empty ghost town

What still bugs me

The government didn’t just withhold whole documents. In a lot of places, it looks like they blacked out specific names or transactions inside documents they did release. Maybe there are legit reasons for some of it. But at this volume, it needs scrutiny.

Also, the 2013 to 2019 passenger manifest gap is still a thing in the public record. Tons of flights, but not the corresponding names. The database

Everything is at EpsteinExposed.com. Free. No ads. No paywall. You can browse without logging in. Accounts are only for making dossiers and posting notes.

There’s also a community forum for collab research: https://board.epsteinexposed.com

If you find errors, call them out. If you want a specific thread turned into a dossier, say the name and I’ll help you get it set up. TL;DR

The database went from ~6k docs to 1.5M in a week. Full text searchable. We ran redaction analysis at scale, flagged a huge pile for human review, recovered a lot of hidden text, and the current Congress/DOJ redaction fight is now fully indexed in the same place. Update:

I went to sleep thinking this would be a normal update post and woke up to it hitting r/popular / r/all.

Thank you. Seriously.

In ~4 hours this hit ~750k views and people have already donated ~$800. That is wild, and it genuinely helps keep the lights on while I keep ingesting and cleaning data and everything goes toward making the site better!

A quick housekeeping thing because it needs to be said on posts like this:

Being named in a document is not proof of wrongdoing. People show up in emails, contact lists, forwarded threads, or because someone mentioned them.

Please don’t dox, harass, or post “I found their address” type stuff. If you want this taken seriously by journalists and agencies, it has to stay clean and source-based.

If you spot bad OCR, duplicates, broken links, or a false connection, call it out. That kind of boring cleanup work is how this gets stronger.

If you want to help, the best thing is still commenting and sharing. Second best is reporting errors or building a dossier on a specific thread so the research is organized and verifiable.

Also, small but important technical update: Semantic / Smart search is going live soon. Keyword search is great, but it misses anything that is phrased differently. Smart search uses a hybrid approach so you can search meaning, not just exact words. It’s already wired up, I’m generating the embeddings now and seeding them into the database next.

Show HN: Joria – a native Mac notes app for instant capture and semantic recall

Every blog post I have shared until 2026

One Task at a Time, Even with AI

Scott Adams and the Art of Dying (and Living Forever) Online

Jmail hits 450M views, Vercel CEO agrees to handle server costs

BinaryAudit: Can AI find backdoors in raw machine code?

AI is making online crimes easier. It could get worse

SafeRun Guard- Runtime safety firewall for AI coding agents (bash+jq, zero deps)

PyTorch Now Uses Pyrefly for Type Checking

DiffSwarm: Multi-agent code review from your terminal (BYOK, runs locally)

Discord Voluntarily Pushes Mandatory Age Verification Despite Recent Data Breach

Show HN: 1MB iOS apps designed to reduce mental open loops

Trump Antitrust Is Dead

Chris Liddell appointed to Anthropic's board of directors

Maybe the Hollywood is cooked guys are cooked too idk

US billionaires race China to moon

The Tast Supply Problem

Unified API Proxy for OpenAI, Anthropic, and Compatible LLM Providers

Show HN: Uber's new publicly available RPC Kafka repository

Show HN: Blip – Ephemeral chat that stores nothing, anywhere. open source

Private Credit's Software Bet Is Even Bigger Than It Appears

SCPI and Hardware Instrumentation for Reverse Engineers

SWE-ContextBench: context learning benchmark in coding

Show HN: Ticksupply – Record Binance tick data (order books, trades) as CSV

Space Forge is sending a factory into space to make materials for semiconductors

New Badge Available for Indicating AI Welcome

RNA droplets may have accelerated Earth's development of complex molecules

Moss: A Linux-compatible Rust async kernel, 3 months on

The economic potential of generative AI: The next productivity frontier

How do you manage prompt versioning and iteration?