frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

A case study in PDF forensics: The Epstein PDFs

https://pdfa.org/a-case-study-in-pdf-forensics-the-epstein-pdfs/
68•DuffJohnson•1h ago

Comments

meidan_y•1h ago
(2025) just follow hn guideline, impressive voter ring though
alain94040•1h ago
We're in early February 2025 [edit:2026] and the article was written on Dec 23, 2025, which makes it less than two months old. I think it's ok not to include a year in the submission title in that case.

I personally understand a year in the submission as a warning that the article may not be up to date.

petepete•1h ago
We're in Feb 2026.

I'm not used to typing it yet, either.

michaelmcdonald•1h ago
"We're in early February ~2025~ *2026*"
GlitchRider47•1h ago
Generally, I'd agree with you. However, the recent Epstein file dump was in 2026, not 2025, so I would say it is relevant in this case..
embedding-shape•33m ago
Less about the age, and more about confusing what they are analyzing, for the files that were just released like a week ago.
tibbon•1h ago
That's a lot of PeDoFiles!

(But seriously, great work here!)

waynenilsen•50m ago
> Information leakage may also be occurring via PDF comments or orphaned objects inside compressed object streams, as I discovered above.

hopefully someone is independently archiving all documents

my understanding is that some are being removed

embedding-shape•33m ago
Initially under "Epstein Files Transparency Act (H.R.4405)" on https://www.justice.gov/epstein/doj-disclosures, all datasets had .zip links. I first saw that page when all but dataset 11 (or 10) had a .zip link. At one point this morning, all the .zip links were removed, now it seems like most are back again.
some_random•13m ago
Are they being removed or replaced with more heavily redacted documents? There were definitely some victim names that slipped through the cracks that have since been redacted.
littlecorner•6m ago
I think some of the released documents included images of victims, which where redacted. So it's not necessarily malicious removals
corygarms•39m ago
These folks must really have their hands full with the 3M+ pages that were recently released. Hoping for an update once they expand this work to those new files.
embedding-shape•35m ago
Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.
originalvichy•27m ago
Did you take any steps to decrease the dimension size of images, if this increases the performance? I have not tried this as I have not peformed an OCR task like this with an LLM. I would be interested to know at what size the vlm cannot make out the details in text reliably.
embedding-shape•25m ago
The performance is OK, takes a couple of seconds at most on my GPU, just the amount of documents to get through that takes time, even with parallelism. The dimension seems fine as it is, as far as I can tell.
nkozyra•30m ago
> DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata

Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right?

originalvichy•21m ago
Maybe they know more than we do. It may be possible to tamper with files at a deeper level. I wonder if it is also possible to use some sort of tampered compression algorithm that could mark images much like printers do with paper.

Another guess is that perhaps the step is a part of a multi-step sanitation process, and the last step(s) perform the bitmap operation.

normalaccess•6m ago
I'm not sure about computer image generation but you can (relatively) easily fingerprint images generated by digital cameras due to sensor defects. I'll bet there is a similar problem with PC image generation where even without the EXIF data there is probably still too much side channel data leakage.
originalvichy•30m ago
Any guesses why some of the newest files seem to have random ”=” characters in the text? My first thought was OCR, but it seemed to not be linked to characters like ”E” that could be mistakenly interpreted by an OCR tool. My second guess is just making it more difficult to produce reliable text searches, but probably 90% of HN readers could find a way to make a search tool that does not fall apart in case a ”=” character is found (although making this work for long search queries would make the search slower).
torh•29m ago
Was on the frontpage yesterday: https://news.ycombinator.com/item?id=46868759
originalvichy•21m ago
Thanks a lot!
ripe•18m ago
The equal characters are due to poor handling of quoted-printable in email.

The author of gnus, Lars Ingebrigtsen, wrote a blog post explaining this. His post was on the HN front page today.

originalvichy•8m ago
He explained the newline thing that confused me. Good read!
bugeats•23m ago
Somebody ought to train an LLM exclusively on this text, just for funsies.
pc86•8m ago
DeepSeek-V4-JEE

Writing a SQL database, take two: Zig and RocksDB

https://notes.eatonphil.com/zigrocks-sql.html
1•ibobev•44s ago•0 comments

Converting data to hexadecimal outputs quickly

https://lemire.me/blog/2026/02/02/converting-data-to-hexadecimal-outputs-quickly/
1•ibobev•1m ago•0 comments

Judgment Isn't Uniquely Human

https://stevenadler.substack.com/p/judgment-isnt-uniquely-human
1•gmays•1m ago•0 comments

Expansion Microscopy Has Transformed How We See the Cellular World

https://www.quantamagazine.org/expansion-microscopy-has-transformed-how-we-see-the-cellular-world...
1•ibobev•3m ago•0 comments

WebCad – free browser-based CAD with AI (export STEP)

https://app.webcad.ca/
1•tonio67•4m ago•1 comments

Show HN: Backseat Writer – AI pair writing

https://backseat-writer.vercel.app/demo
1•Dansvidania•6m ago•0 comments

Show HN: Implementation of Google's PaperBanana (diagram generation from text)

https://github.com/llmsresearch/paperbanana
1•dippatel1994•8m ago•0 comments

Clean Coder: The Dark Path (2017)

https://blog.cleancoder.com/uncle-bob/2017/01/11/TheDarkPath.html
1•andrewjf•8m ago•1 comments

What Do You Think of My Business Idea? (Claude Ad) [video]

https://www.youtube.com/watch?v=De-_wQpKw0s
2•eamag•11m ago•0 comments

Show HN: Grok Imagine – High-fidelity FLUX.1 generation with cinematic video

https://grok-imagine.me/
1•thenextechtrade•12m ago•0 comments

Show HN: Seren – Serverless Postgres, Rust SDK, CLI, & MCP Server for AI Agents

https://github.com/serenorg/seren
2•taariqlewis•12m ago•0 comments

Recursive Knowledge Synthesis for Multi-LLM Systems

https://arxiv.org/abs/2601.08839
1•bob1029•14m ago•0 comments

Microsoft's Pivotal AI Product Is Running into Big Problems

https://www.wsj.com/tech/ai/microsofts-pivotal-ai-product-is-running-into-big-problems-ce235b28
3•fortran77•15m ago•1 comments

Even after cutting EV incentives, Norway only sold 98 diesel cars in January

https://electrek.co/2026/02/03/even-after-cutting-ev-incentives-norway-only-sold-98-diesel-cars-i...
3•ceejayoz•17m ago•0 comments

Show HN: CuaBot – Co-op computer-use for any coding agent

https://github.com/trycua/cua
1•frabonacci•18m ago•0 comments

Forensic Photonics verifies digital evidence with Content Credentials

https://contentauthenticity.org/blog/how-forensic-photonics-verifies-digital-evidence-with-conten...
1•hasheddan•19m ago•0 comments

DuoBolt – a review-first duplicate file finder powered by BLAKE3

https://duobolt.app/
2•r9ne•19m ago•1 comments

LibreQoS: Online Bufferbloat Test

https://bufferbloat.libreqos.com/
1•goodburb•20m ago•0 comments

Why the Future of Movies Lives on Letterboxd

https://www.nytimes.com/interactive/2026/02/03/magazine/letterboxd-film-discussion-site-streaming...
1•mitchbob•20m ago•1 comments

How do you validate AI-generated data transformations before prod?

https://www.yorph.ai
1•areddyfd•20m ago•1 comments

If AI Writes the Code, What Should Engineers Learn?

https://the-learning-agency.com/the-cutting-ed/article/if-ai-writes-the-code-what-should-engineer...
2•selvaprakash•20m ago•0 comments

A programmable, Lego-like material for robots emulates life's flexibility

https://techxplore.com/news/2026-02-programmable-lego-material-robots-emulates.html
1•Brajeshwar•21m ago•0 comments

Anthropic Super Bowl Spot Skewers ChatGPT Ads

https://www.businessinsider.com/anthropic-super-bowl-openai-chatgpt-ads-claude-2026-2
2•tortilla•21m ago•0 comments

Physicists achieve near-zero friction on macroscopic scales

https://phys.org/news/2026-02-physicists-friction-macroscopic-scales.html
1•Brajeshwar•21m ago•0 comments

Pipe organ playing a single, nonstop song until 2640

https://www.popsci.com/technology/pipe-organ-one-song-2640/
1•Brajeshwar•21m ago•0 comments

SpaceX grounds Falcon 9 missions, could impact ISS launch

https://phys.org/news/2026-02-spacex-grounds-falcon-missions-impact.html
2•bookmtn•22m ago•0 comments

Show HN: Distr 2.0 – A year of learning how to ship to customer environments

https://github.com/distr-sh/distr
1•louis_w_gk•22m ago•0 comments

Show HN: Orpheus, An Agent runtime that scales on queue depth and not CPU

https://github.com/arpitnath/orpheus
3•arpitnath42•24m ago•0 comments

Anthropic Performance Team Take-Home for Dummies

https://www.ikot.blog/anthropic-take-home-for-dummies
2•vinhnx•25m ago•0 comments

A field guide to sandboxes for AI

https://www.luiscardoso.dev/blog/sandboxes-for-ai
1•Dangeranger•25m ago•0 comments