frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

X-ray: a Python library for finding bad redactions in PDF documents

https://github.com/freelawproject/x-ray
119•rendx•2h ago

Comments

seanw444•2h ago
The context for OP posting this is that many of the recently-released Epstein documents were PDFs "redacted" by being drawn on top of.
formerly_proven•2h ago
Is there a good free tool to properly redact PDFs? My workflow is to place black annotation rectangles on top and then print as PDF with "force rasterization" on. The resulting PDF files then just consist of pages with one image each. But this tends to be really suboptimal, because it's usually a grayscale or color rasterization, so file sizes are very large vs. monochrome PDFs with CCITT G3/G4 compression (which is absolutely what you want for text content, excellent compression and lossless). Post-processing PDFs to convert them to CCITT is rather annoying and I only know of CLI ways.
agumonkey•2h ago
I wasn't sure of this, even though sometimes you'd see remains of the original characters near rectangles edges.. does this mean the leaked documents have been de-redacted ?
kstrauser•2h ago
At least some, yes: https://daringfireball.net/linked/2025/12/23/trump-doj-pdf-r...
agumonkey•1h ago
yeah i expected every political team, even the low level ones, to be fully aware of naive pdf "edition"... alas, incompetence often does that
arthurcolle•1h ago
Checks and balances for a more technological era.
airstrike•1h ago
Survival of the leetest
zahlman•1h ago
I'm actually surprised not to have yet heard widespread conspiracy theorization that this is deliberate for some inscrutable reason or other.
kstrauser•1h ago
Something something "chess, not checkers, this proves he has them on the run!"
k1t•2h ago
Yes, in some cases, eg. https://news.ycombinator.com/item?id=46364121
agumonkey•1h ago
oh that's a beautiful sight

hopefully this is straw that breaks the camel's back

XorNot•1h ago
Why would that be the case? The government isn't redacting "yes we contacted aliens" they're redacting information about military capabilities that might be of use to adversaries.
agumonkey•1h ago
sorry the title mentioned epstein files, so i was hoping incriminating facts that would accelerate trump's fall
jibal•1h ago
No reason to be sorry ... you are right and the other person seems quite confused about the context.
arthurcolle•1h ago
Also good for UFO/UAP/"anomalous phenomena" documents and remote viewing PDFs for what it's worth :)
IceHegel•2h ago
Given recent high profile redaction events, I think one simple use of AI would be to have it redact documents according to an objective standard.

That should in theory prevent overly redacted documents for political purposes.

An approach that could be rolled out today would be redacting with human review, but showing what % of redactions the AI would have done, and also showing the prompt given to the AI to perform redactions.

mmazing•1h ago
Honestly, it doesn't take any inference or need for AI, there's simply data in the documents that can be extracted.
bogtog•1h ago
I don't think the commentor above is saying that an AI should necessarily apply the redaction. Rather, an AI can serve as an objective-ish way of determining what should be redacted. This seems somewhat analogous to how (non-AI) models can we used to evaluate how gerrymandered a map is
unfocused•1h ago
Adobe Pro, when used properly, will redact anything in a PDF permanently.

Whoever did these "bad" redactions doesn't even know how to use a PDF Editor.

We have paralegals and lawyers "mark for redaction", then review the documents, then "apply redactions". It's literally be done by thousands of lawyers/paralegals for decades. This is just someone not following the process and procedure, and making mistakes. It's actually quite amateurish. You should never, ever screw up redactions if you follow the proper process. Good on the X-ray project on trying to find errors.

I just want to add, applying black highlights on top of text is in fact, the "old" way of redaction, as it was common to do this, and then simply print the paper with the black bars, and send the paper as the final product.

Whoever did it is probably old, and may have done it thinking they were going to print it on paper afterwards!! Just guessing as to why someone would do this.

tgsovlerkhgsel•1h ago
Or they may not understand how PDF works and think that it's the same as paper.

Especially with the "draw a black box over it" method, the text also stops being trivially mouse-selectable (even if CTRL+A might still work).

Another possibility is, of course, that whoever was responsible for this knew exactly what they were doing, but this way they can claim a honest mistake rather than intentionally leaking the data.

zahlman•1h ago
> Or they may not understand how PDF works and think that it's the same as paper.

Yes; that's presumably included in being "amateurish" and "not following proper process".

aidos•1h ago
A while back I did a little work with a company that were meant to help us improve our security posture. I terminated the contract after they sent me documents in which they’d redacted their own AWS keys using this method.
selectodude•5m ago
Any attorney or law enforcement that works for the US Federal Government receives very, very comprehensive instructions on how to redact information on basically the first day of training. There is absolutely zero doubt among any of my DOGE'd friends that this was 100 percent on purpose malicious compliance.
mlissner•1h ago
Cool to see this here. It’s funny because we do so many huge, complex, multiyear projects at Free Law Project, but this is the most viral any of our work has ever gone!

Anyway, I made X-ray to analyze the millions of documents we have in CourtListener so that we can try to educate people about the issue.

The analysis was fun. We used S3 batch jobs, but we haven’t done the hard part of looking at the results and reporting them out. One day.

thangalin•28m ago
https://www.argeliuslabs.com/deep-research-on-pdf-redaction-...

> Information Leaking from Redaction Marks: Even when content is properly removed, the redaction marks themselves can leak some information if not done carefully. For example, if you have a black box exactly covering a word, the length of that black box gives a clue to the word’s length (and potentially its identity).

Does X-ray employ glyph spacing attacks and try to exploit font metric leaks?

gigatexal•1h ago
Hilarious that DOJ didn’t flatten the layers so you can unredact stuff. What a clown show of incompetent idiots. Or… a skillful one over on the powers that be internally from someone who knew better but knew that they wouldn’t know … and did this to help us all
brotchie•1h ago
You'd think the go-to workflow for releasing redacted PDFs would be to draw black rectangles and then rasterize to image-only PDFs :shrug:
shbooms•48m ago
often times you will have requirements that the documents you release be digitally searchable and so in these cases, this would not be an option
8note•30m ago
run some ocr on them after to recreate the text layer?
selinkocalar•28m ago
As someone who's built an entire business on "anti-screenshots" this is brilliant.

PDF redaction fails are everywhere and it's usually because people don't understand that covering text with a black box doesn't actually remove the underlying data.

I see this constantly in compliance. People think they're protecting sensitive info but the original text is still there in the PDF structure.

embedding-shape•8m ago
Not to mention some PDF editors preserve previous edits in the PDF file itself, which people also seems unaware of. A bit more user friendly description of the feature without having to read the specification itself: https://developers.foxit.com/developer-hub/document/incremen...
embedding-shape•44m ago
I haven't gone through more than just 10% of the files released today, but noticed that at least EFTA00037069.pdf for example has a `/Prev` pointer, meaning the previous revision of the file is available inside of the PDF itself. In this case, the difference is minor, but I'm guessing if it's in one file, it could be more. You can run `qpdf --show-object=trailer EFTA00037069.pdf` on a PDF file to see for yourself if it's there.

I'm almost fully convinced that someone did this bad intentionally, together with the bad redactions, as surely people tasked with redacting a bunch of files receive some instructions on what to do/not to do?

jmward01•23m ago
Hmmm.. The more I think about this the more any font kerning is likely a major leak for redaction. Even if the boxes have randomness applied to them, the words around a blacked out area have exact positioning that constrains the text within so that only certain letter/space combinations could fit between them. With a little knowledge of the rendering algorithm and some educated guessing about the text a bruit force search may be able to do a very credible job of discovering the actual text. This isn't my field. Anyone out there that has actually worked on this problem?
dcollect•12m ago
lol thanks bros

text=about them to damage their credibility when they tried to go public with their stories of being text=Epstein also threatened harm to victims and helped release damaging stories =attorneys' fees and case costs in litigation related to this conduct.

=Defendants also attempted to conceal their criminal sex trafficking and abuse

text=$327,497.48 and $6,487.04 in New York City text=trafficking and abuse conduct. text=destroy evidence relevant to ongoing court proceedings involving Defendants' criminal sex text=Epstein also instructed one or more Epstein Enterprise participant-witnesses to text=trafficked and sexually abused. text=conduct by paying large sums of money to participant-witnesses, including by paying for their

Fabrice Bellard Releases MicroQuickJS

https://github.com/bellard/mquickjs/blob/main/README.md
728•Aissen•6h ago•287 comments

X-ray: a Python library for finding bad redactions in PDF documents

https://github.com/freelawproject/x-ray
122•rendx•2h ago•36 comments

Texas app store age verification law blocked by federal judge

https://www.macrumors.com/2025/12/23/texas-app-store-law-blocked/
89•danso•2h ago•52 comments

Is Northern Virginia Still the Least Reliable AWS Region?

https://statusgator.com/blog/aws-least-reliable-region-in-2025/
23•colinbartlett•1h ago•3 comments

Perfect Software – Software for an Audience of One

https://outofdesk.netlify.app/blog/perfect-software
89•ggauravr•3d ago•29 comments

Learn Lisp/Fennel Programming Against Neovim

https://github.com/humorless/fennel-fp-neovim
9•veqq•6d ago•0 comments

We Must Seize the Means of Compute

https://thompson2026.com/blog/seize-the-means-of-compute/
5•NickForLiberty•59m ago•0 comments

Lua 5.5

https://lua.org/versions.html#5.5
187•km•1d ago•49 comments

We replaced H.264 streaming with JPEG screenshots (and it worked better)

https://blog.helix.ml/p/we-mass-deployed-15-year-old-screen
300•quesobob•6h ago•188 comments

Terrence Malick's Disciples

https://yalereview.org/article/bilge-ebiri-terrence-malick
65•prismatic•4h ago•14 comments

Help My c64 caught on fire

https://c0de517e.com/026_c64fire.htm
61•ibobev•5h ago•24 comments

HTTP Caching, a Refresher

https://danburzo.ro/http-caching-refresher/
44•danburzo•4h ago•5 comments

Adobe Photoshop 1.0 Source Code (1990)

https://computerhistory.org/blog/adobe-photoshop-source-code/
412•tosh•5d ago•124 comments

Towards a secure peer-to-peer app platform for Clan

https://clan.lol/blog/towards-app-platform-vmtech/
72•throawayonthe•6h ago•15 comments

Un-Redactor

https://github.com/kvthweatt/unredactor
42•kvthweatt•5h ago•37 comments

Microspeak: North Star – The Old New Thing (2015)

https://devblogs.microsoft.com/oldnewthing/20151103-00/?p=91861
14•rbanffy•2h ago•4 comments

Instant database clones with PostgreSQL 18

https://boringsql.com/posts/instant-database-clones/
371•radimm•16h ago•147 comments

Meta is using the Linux scheduler designed for Valve's Steam Deck on its servers

https://www.phoronix.com/news/Meta-SCX-LAVD-Steam-Deck-Server
518•yellow_lead•7h ago•274 comments

Fifty problems with standard web APIs in 2025

https://zerotrickpony.com/articles/browser-bugs/
57•dhruv3006•5d ago•11 comments

Toad is a unified experience for AI in the terminal

https://willmcgugan.github.io/toad-released/
142•nikolatt•1d ago•38 comments

Go-boot: bare metal Go UEFI boot manager

https://github.com/usbarmory/go-boot
58•nateb2022•6d ago•19 comments

What makes you senior

https://terriblesoftware.org/2025/11/25/what-actually-makes-you-senior/
204•mooreds•4d ago•110 comments

Local AI is driving the biggest change in laptops in decades

https://spectrum.ieee.org/ai-models-locally
169•barqawiz•1d ago•179 comments

Executorch: On-device AI across mobile, embedded and edge for PyTorch

https://github.com/pytorch/executorch
106•klaussilveira•5d ago•16 comments

Fixed-Wing Runway Design

https://www.wbdg.org/building/aviation/fixed-wing-runway-design
18•DarkContinent•4h ago•10 comments

iOS 26.3 brings AirPods-like pairing to third-party devices in EU under DMA

https://www.macrumors.com/2025/12/22/ios-26-3-dma-airpods-pairing/
209•Tomte•18h ago•168 comments

10 years bootstrapped: €6.5M revenue with a team of 13

https://www.datocms.com/blog/a-look-back-at-2025
277•steffoz•16h ago•104 comments

Show HN: Claude Wrapped in the terminal, with a WASM raymarcher

https://spader.zone/wrapped/
6•dboon•2h ago•0 comments

Astrophotography Target Planner: Discover Hidden Nebulas

https://astroimagery.com/techniques/imaging/astrophotography-target-planner/
51•kianN•4d ago•4 comments

LAVD: Meta's New Default Scheduler [pdf]

https://lpc.events/event/19/contributions/2099/attachments/1875/4020/lpc-2025-lavd-meta.pdf
17•todsacerdoti•5h ago•1 comments