frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

A case study in PDF forensics: The Epstein PDFs

https://pdfa.org/a-case-study-in-pdf-forensics-the-epstein-pdfs/
106•DuffJohnson•2h ago

Comments

meidan_y•2h ago
(2025) just follow hn guideline, impressive voter ring though
alain94040•2h ago
We're in early February 2025 [edit:2026] and the article was written on Dec 23, 2025, which makes it less than two months old. I think it's ok not to include a year in the submission title in that case.

I personally understand a year in the submission as a warning that the article may not be up to date.

petepete•2h ago
We're in Feb 2026.

I'm not used to typing it yet, either.

michaelmcdonald•2h ago
"We're in early February ~2025~ *2026*"
GlitchRider47•1h ago
Generally, I'd agree with you. However, the recent Epstein file dump was in 2026, not 2025, so I would say it is relevant in this case..
embedding-shape•1h ago
Less about the age, and more about confusing what they are analyzing, for the files that were just released like a week ago.
tibbon•1h ago
That's a lot of PeDoFiles!

(But seriously, great work here!)

ted_bunny•26m ago
Elite PDF File ring
waynenilsen•1h ago
> Information leakage may also be occurring via PDF comments or orphaned objects inside compressed object streams, as I discovered above.

hopefully someone is independently archiving all documents

my understanding is that some are being removed

embedding-shape•1h ago
Initially under "Epstein Files Transparency Act (H.R.4405)" on https://www.justice.gov/epstein/doj-disclosures, all datasets had .zip links. I first saw that page when all but dataset 11 (or 10) had a .zip link. At one point this morning, all the .zip links were removed, now it seems like most are back again.
some_random•51m ago
Are they being removed or replaced with more heavily redacted documents? There were definitely some victim names that slipped through the cracks that have since been redacted.
littlecorner•44m ago
I think some of the released documents included images of victims, which where redacted. So it's not necessarily malicious removals
dylan604•2m ago
That's my understanding too, so archiving the unredacted images could mean holding CSAM.
corygarms•1h ago
These folks must really have their hands full with the 3M+ pages that were recently released. Hoping for an update once they expand this work to those new files.
embedding-shape•1h ago
Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.
originalvichy•1h ago
Did you take any steps to decrease the dimension size of images, if this increases the performance? I have not tried this as I have not peformed an OCR task like this with an LLM. I would be interested to know at what size the vlm cannot make out the details in text reliably.
embedding-shape•1h ago
The performance is OK, takes a couple of seconds at most on my GPU, just the amount of documents to get through that takes time, even with parallelism. The dimension seems fine as it is, as far as I can tell.
nkozyra•1h ago
> DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata

Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right?

originalvichy•1h ago
Maybe they know more than we do. It may be possible to tamper with files at a deeper level. I wonder if it is also possible to use some sort of tampered compression algorithm that could mark images much like printers do with paper.

Another guess is that perhaps the step is a part of a multi-step sanitation process, and the last step(s) perform the bitmap operation.

normalaccess•44m ago
I'm not sure about computer image generation but you can (relatively) easily fingerprint images generated by digital cameras due to sensor defects. I'll bet there is a similar problem with PC image generation where even without the EXIF data there is probably still too much side channel data leakage.
originalvichy•1h ago
Any guesses why some of the newest files seem to have random ”=” characters in the text? My first thought was OCR, but it seemed to not be linked to characters like ”E” that could be mistakenly interpreted by an OCR tool. My second guess is just making it more difficult to produce reliable text searches, but probably 90% of HN readers could find a way to make a search tool that does not fall apart in case a ”=” character is found (although making this work for long search queries would make the search slower).
torh•1h ago
Was on the frontpage yesterday: https://news.ycombinator.com/item?id=46868759
originalvichy•59m ago
Thanks a lot!
ripe•57m ago
The equal characters are due to poor handling of quoted-printable in email.

The author of gnus, Lars Ingebrigtsen, wrote a blog post explaining this. His post was on the HN front page today.

originalvichy•46m ago
He explained the newline thing that confused me. Good read!
bugeats•1h ago
Somebody ought to train an LLM exclusively on this text, just for funsies.
pc86•46m ago
DeepSeek-V4-JEE
_def•37m ago
I can't even download the archive, the transmission always terminates just before its finished. Spooky.
ted_bunny•25m ago
Has anyone analysed JE's writing style and looked for matches in archived 4chan posts or content from similar platforms? Same with Ghislaine, there should be enough data to identify them atp right? I don't buy the MaxwellHill claims for various reasons but it doesn't mean there's nothing to find.
kmeisthax•18m ago
I'm pretty sure Epstein tried to meet with moot at least once: https://www.jmail.world/search?q=chris+poole

Voxtral Transcribe 2

https://mistral.ai/news/voxtral-transcribe-2
155•meetpateltech•1h ago•38 comments

Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

https://arxiv.org/abs/2602.00294
74•fheinsen•2h ago•33 comments

A sane but bull case on Clawdbot / OpenClaw

https://brandon.wang/2026/clawdbot
126•brdd•1d ago•201 comments

Tractor

https://incoherency.co.uk/blog/stories/tractor.html
36•surprisetalk•19h ago•11 comments

A case study in PDF forensics: The Epstein PDFs

https://pdfa.org/a-case-study-in-pdf-forensics-the-epstein-pdfs/
107•DuffJohnson•2h ago•33 comments

Converge (YC S23) Is Hiring Product Engineers (NYC, In-Person)

https://www.runconverge.com/careers/product-engineer
1•thomashlvt•1m ago

Data centers in space makes no sense

https://civai.org/blog/space-data-centers
923•ajyoon•21h ago•1044 comments

Guinea worm on track to be 2nd eradicated human disease; only 10 cases in 2025

https://arstechnica.com/health/2026/02/guinea-worm-on-track-to-be-2nd-eradicated-human-disease-on...
95•bookofjoe•2h ago•36 comments

Procedures for Repair of Potholes in Asphalt-Surfaced Pavements

https://highways.dot.gov/media/7941
14•treebrained•3d ago•13 comments

Lessons learned shipping 500 units of my first hardware product

https://www.simonberens.com/p/lessons-learned-shipping-500-units
737•sberens•2d ago•354 comments

Old Insurance Maps – Georeferencing Sanborn Fire Insurance Maps on Modern Maps

https://oldinsurancemaps.net/
46•lapetitejort•1w ago•11 comments

FBI couldn't get into WaPo reporter's iPhone because Lockdown Mode enabled

https://www.404media.co/fbi-couldnt-get-into-wapo-reporters-iphone-because-it-had-lockdown-mode-e...
325•robin_reala•2h ago•268 comments

Coding Agent VMs on NixOS with Microvm.nix

https://michael.stapelberg.ch/posts/2026-02-01-coding-agent-microvm-nix/
27•secure•3d ago•11 comments

Show HN: Ghidra MCP Server – 110 tools for AI-assisted reverse engineering

https://github.com/bethington/ghidra-mcp
205•xerzes•10h ago•52 comments

Show HN: Craftplan – I built my wife a production management tool for her bakery

https://github.com/puemos/craftplan
478•deofoo•2d ago•142 comments

Brazilian Micro-SaaS Map

https://saas-map.ssr.trapiche.cloud/
68•acfilho•3d ago•3 comments

I miss thinking hard

https://www.jernesto.com/articles/thinking_hard
1051•jernestomg•13h ago•575 comments

New York’s budget bill would require “blocking technology” on all 3D printers

https://blog.adafruit.com/2026/02/03/new-york-wants-to-ctrlaltdelete-your-3d-printer/
594•ptorrone•1d ago•691 comments

Microsoft's Pivotal AI Product Is Running into Big Problems

https://www.wsj.com/tech/ai/microsofts-pivotal-ai-product-is-running-into-big-problems-ce235b28
17•fortran77•54m ago•5 comments

Thatcher Effect – Optical Illusion and Explanation

https://optical.toys/thatcher-effect/
34•robin_reala•3h ago•10 comments

Deno Sandbox

https://deno.com/blog/introducing-deno-sandbox
497•johnspurlock•23h ago•152 comments

Agent Skills

https://agentskills.io/home
501•mooreds•1d ago•241 comments

The fax numbers of the beast, and other mathematical sports

https://cabinetmagazine.org/issues/57/wertheim.php
20•marysminefnuf•1d ago•8 comments

Broken Proofs and Broken Provers

https://lawrencecpaulson.github.io/2026/01/15/Broken_proofs.html
44•RebelPotato•8h ago•8 comments

X offices raided in France as UK opens fresh investigation into Grok

https://www.bbc.com/news/articles/ce3ex92557jo
525•vikaveri•1d ago•997 comments

High-Altitude Adventure with a DIY Pico Balloon

https://spectrum.ieee.org/explore-stratosphere-diy-pico-balloon
85•jnord•3d ago•42 comments

Goblins: Distributed, Transactional Programming with Racket and Guile

https://spritely.institute/goblins/
96•alhazrod•4d ago•15 comments

AliSQL: Alibaba's open-source MySQL with vector and DuckDB engines

https://github.com/alibaba/AliSQL
269•baotiao•22h ago•40 comments

Xcode 26.3 – Developers can leverage coding agents directly in Xcode

https://www.apple.com/newsroom/2026/02/xcode-26-point-3-unlocks-the-power-of-agentic-coding/
351•davidbarker•22h ago•301 comments

The Mathematics of Tuning Systems

https://math.ucr.edu/home/baez/tuning_talk/
65•u1hcw9nx•4d ago•12 comments