frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Take potentially dangerous PDFs, and convert them to safe PDFs

https://github.com/freedomofpress/dangerzone
78•dp-hackernews•2h ago

Comments

dfajgljsldkjag•2h ago
I personally just upload them to google drive. It would be a serious pwn if they could somehow still do a compromise through google drive.
gleenn•1h ago
Do you have any specifics on what Drive does? Any examples of it fixing embedded virii? Or is this blind assumption?
akersten•1h ago
I assume they mean "upload to drive and use the web based reader to view the PDF," not "upload to drive and download it again"
gleenn•34m ago
And what special sauce does the web preview use? At some point, someone has to actually parse and process the data. I feel like on a tech site like Hacker News, speculating that Google has somehow done a perfect job of preventing malicious PDFs beckons the question: how do you actually do that and prove that it's safe? And is that even possible in perpetuity?
bob1029•1h ago
Does google drive apply any transformation over the PDF, or are you effectively loading the same document in your browser on the round trip?
Gigachad•31m ago
They have some kind of virus scanner for files you open via a share link. Not sure about the ones you have stored on your own drive unshared.

But probably the main security here is just using the chrome pdf viewer instead of the adobe one. Which you can do without google drive. The browser PDF viewers ignore all the strange and risky parts of the PDF spec that would likely be exploited.

venusenvy47•16m ago
I often view PDFs in Drive, and it's definitely not just displaying the document with the native web browser. It is rendered with their "Drive renderer", whatever that is. They don't even display a simple .txt file natively in the browser.
snowmobile•1h ago
It's a neat program, but what's the use for JPGs and PNGs?
boston_clone•1h ago
There are some neat detection bypass / compromise methods using various image formats, including PNG [0] and SVG [1]!

I imagine that folks like journalists could have that type of attack in their threat model, and EFF already do a lot of great stuff in this space :)

0. https://isc.sans.edu/diary/31998

1. https://www.cloudflare.com/cloudforce-one/research/svgs-the-...

mike_d•1h ago
Shameless self promotion: preview.ninja is a site I built that does this and supports 300+ file formats. I'm currently weekend coding version 2.0 which will support 500+ formats and allow direct data extraction in addition to safe viewing.

It is a passion project and will always be free because commercial CDR[1] solutions are insanely expensive and everyone should have access to the tools to compute securely.

1. https://en.wikipedia.org/wiki/Content_Disarm_%26_Reconstruct...

coppsilgold•1h ago
While useful it needs a big red warning to potential leakers. If they were personally served documents (such as via email, while logged in, etc) there really isn't much that can be done to ascertain the safety of leaking it. It's not even safe if there are two or more leakers and they "compare notes" to try and "clean" something for release.

https://en.wikipedia.org/wiki/Traitor_tracing#Watermarking

https://arxiv.org/abs/1111.3597

The watermark can even be contained in the wording itself (multiple versions of sentences, word choice etc stores the entropy). The only moderately safe thing to leak would be a pure text full paraphrasing of the material. But that wouldn't inspire much trust as a source.

alphazard•1h ago
I seem to remember Yahoo finance (I think it was them, maybe someone else) introducing benign errors into their market data feeds, to prevent scraping. This lead to people doing 3 requests instead of just 1, to correct the errors, which was very expensive for them, so they turned it off.

I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.

coppsilgold•1h ago
> I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.

This is a very common assumption that turns out to be false.

There are Tardos probabilistic codes (see the paper I linked) which have the watermark scale as the square of the traitor count.

For example, with a watermark of just 400 bits, 4 traitors (who try their best to corrupt the watermark) will stand out enough to merit investigation and with 800 bits be accused without any doubt. This is for a binary alphabet, with text you can generate a bigger alphabet and have shorter watermarks.

These are typically intended for tracing pirated content, so they carry the so-called Marking Assumption (if given two or more versions of a piece of content, you must choose one. A pirate isn't going to corrupt or remove a piece of video, that would be unsuitable for leaking). So it would likely be possible to get better results with documents, may require larger watermarks to get such traitors reliably.

crazygringo•1h ago
This doesn't seem to be designed for leakers, i.e. people sending PDF's -- it's specifically for people receiving untrusted files, i.e. journalists.

And specifically about them not being hacked by malicious code. I'm not seeing anything that suggests it's about trying to remove traces of a file's origin.

I don't see why it would need a warning for something it's not designed for at all.

coppsilgold•1h ago
It would be natural for a leaker to assume that the PDF contains something "extra" and to try and and remove it with this method. It may not occur to them that this something extra could be part of the content they are going to get back.
david_shaw•25m ago
From the tool description linked:

> Dangerzone works like this: You give it a document that you don't know if you can trust (for example, an email attachment). Inside of a sandbox, Dangerzone converts the document to a PDF (if it isn't already one), and then converts the PDF into raw pixel data: a huge list of RGB color values for each page. Then, outside of the sandbox, Dangerzone takes this pixel data and converts it back into a PDF.

With this in mind, Dangerzone wouldn't even remove conventional watermarks (that inlay small amounts of text on the image).

I think the "freedomofpress" GitHub repo primed you to think about protecting someone leaking to journalists, but really it's designed to keep journalists (and other security-minded folk) safe from untrusted attachments.

The official website -- https://dangerzone.rocks/ -- is a lot more clear about exactly what the tool does. It removes malware, removes network requests, supports various filetypes, and is open source.

Their about page ( https://dangerzone.rocks/about/ ) shows common use cases for journalists and others.

chaps•1h ago
Heh, I've seen this a bunch of times and it's of interest to me, but honestly? It's sooooo limiting by being an interface without a complementary command line tool. Like, I'd like to put this into some workflows but it doesn't really make sense to without using something like pyautogui. But maybe I'm missing something hidden in the documentation.
crazygringo•1h ago
It seems to meant for end-users like journalists processing files individually like e-mail attachments.

It doesn't seem to be meant for usage at scale -- it's not for general-purpose conversion, as the resulting files are huge, will have OCR errors, etc.

chaps•1h ago
I'm the target audience for this sort of tool. :)
tclancy•27m ago
https://github.com/freedomofpress/dangerzone/blob/main/dange...

How hard did you look the other times?

chaps•17m ago
Not much further than their documentation, friend! But thanks for finding that, that's actually super helpful! I hope somebody puts in a pr for updating the documentation to make it clear what functionality their tool has.
jevinskie•47m ago
Seems like a similar but less elegant solution as parsing and normalization to a “safe” subset but not just blasting it to pixels.

https://github.com/caradoc-org/caradoc

http://spw16.langsec.org/slides/guillaume-endignoux-slides.p...

PaulDavisThe1st•44m ago
Is there some reason why just viewing the PDF with a FLOSS, limited PDF viewer (e.g. atril) would not accomplish the same level of safety? What can a "dangerous PDF" do inside atril?
philipkglass•33m ago
It looks like atril is mostly written in C:

https://github.com/mate-desktop/atril

A crafted PDF can potentially exploit a bug in atril to compromise the recipient's computer since writing memory-safe C is difficult. This approach was famously used by a malware vendor to exploit iMessage through a compressed image format that's part of the PDF standard:

https://projectzero.google/2021/12/a-deep-dive-into-nso-zero...

capitainenemo•23m ago
This is why Firefox chose to implement a custom PDF reader in pure JS for better sandboxing leveraging the existing browser JS sandboxing. As a side effect, it's been a helpful JS library for embedding PDFs on websites.

The Chrome PDF parser, originating from Foxit (now open-sourced as PDFium), has been the source of many exploits in Chrome itself over the years.

gu009•34m ago
A handy side use for this is compressing PDFs.

For some reason, printing 1 page of an Excel or Word document to a PDF often gets up to around 4MB in size. Passing it through this compresses it quite well.

Just ran a quick test:

- 1-page Excel PDF export: 3.7MB

- Processing with Dangerzone (OCR enabled): 131KB

Internet voting is insecure and should not be used in public elections

https://blog.citp.princeton.edu/2026/01/16/internet-voting-is-insecure-and-should-not-be-used-in-...
71•WaitWaitWha•33m ago•38 comments

Significant US Farm Losses Persist, Despite Federal Assistance

https://www.fb.org/market-intel/significant-farm-losses-persist-despite-federal-assistance
30•toomuchtodo•33m ago•11 comments

Take potentially dangerous PDFs, and convert them to safe PDFs

https://github.com/freedomofpress/dangerzone
78•dp-hackernews•2h ago•26 comments

Show HN: ChartGPU – WebGPU-powered charting library (1M points at 60fps)

https://github.com/ChartGPU/ChartGPU
503•huntergemmer•10h ago•146 comments

Binary Fuse Filters: Fast and Smaller Than XOR Filters

https://arxiv.org/abs/2201.01174
21•redbell•4d ago•0 comments

Claude's new constitution

https://www.anthropic.com/news/claude-new-constitution
310•meetpateltech•9h ago•304 comments

Show HN: RatatuiRuby wraps Rust Ratatui as a RubyGem – TUIs with the joy of Ruby

https://www.ratatui-ruby.dev/
70•Kerrick•4d ago•6 comments

Skip is now free and open source

https://skip.dev/blog/skip-is-free/
280•dayanruben•10h ago•121 comments

Golfing APL/K in 90 Lines of Python

https://aljamal.substack.com/p/golfing-aplk-in-90-lines-of-python
49•aburjg•5d ago•9 comments

Letting Claude play text adventures

https://borretti.me/article/letting-claude-play-text-adventures
72•varjag•5d ago•28 comments

Show HN: Rails UI

https://railsui.com/
104•justalever•7h ago•71 comments

Challenges in join optimization

https://www.starrocks.io/blog/inside-starrocks-why-joins-are-faster-than-youd-expect
43•HermitX•8h ago•11 comments

The WebRacket language is a subset of Racket that compiles to WebAssembly

https://github.com/soegaard/webracket
95•mfru•4d ago•20 comments

An explanation of cheating in Doom2 Deathmatch (1999)

https://www.doom2.net/doom2/cheating.html
23•Lammy•4d ago•1 comments

Jerry (YC S17) Is Hiring

https://www.ycombinator.com/companies/jerry-inc/jobs/QaoK3rw-software-engineer-core-automation-ma...
1•linaz•4h ago

TrustTunnel: AdGuard VPN protocol goes open-source

https://adguard-vpn.com/en/blog/adguard-vpn-protocol-goes-open-source-meet-trusttunnel.html
62•kumrayu•8h ago•14 comments

Three types of LLM workloads and how to serve them

https://modal.com/llm-almanac/workloads
41•charles_irl•9h ago•1 comments

Mystery of the Head Activator

https://www.asimov.press/p/head-activator
17•mailyk•3d ago•3 comments

Waiting for dawn in search: Search index, Google rulings and impact on Kagi

https://blog.kagi.com/waiting-dawn-search
220•josephwegner•8h ago•143 comments

Setting Up a Cluster of Tiny PCs for Parallel Computing

https://www.kenkoonwong.com/blog/parallel-computing/
28•speckx•6h ago•17 comments

Stevey's Birthday Blog

https://steve-yegge.medium.com/steveys-birthday-blog-34f437139cb5
30•throwawayHMM19•1d ago•9 comments

SIMD programming in pure Rust

https://kerkour.com/introduction-rust-simd
53•randomint64•2d ago•15 comments

Tell HN: 2 years building a kids audio app as a solo dev – lessons learned

42•oliverjanssen•11h ago•28 comments

Can you slim macOS down?

https://eclecticlight.co/2026/01/21/can-you-slim-macos-down/
174•ingve•17h ago•216 comments

Show HN: TerabyteDeals – Compare storage prices by $/TB

https://terabytedeals.com
69•vektor888•4h ago•53 comments

Nested code fences in Markdown

https://susam.net/nested-code-fences.html
187•todsacerdoti•12h ago•63 comments

Open source server code for the BitCraft MMORPG

https://github.com/clockworklabs/BitCraftPublic
41•sfkgtbor•8h ago•12 comments

Slouching Towards Bethlehem – Joan Didion (1967)

https://www.saturdayeveningpost.com/2017/06/didion/
61•jxmorris12•8h ago•7 comments

Scientists find a way to regrow cartilage in mice and human tissue samples

https://www.sciencedaily.com/releases/2026/01/260120000333.htm
256•saikatsg•7h ago•73 comments

I finally got my sway layout to autostart the way I like it

https://hugues.betakappaphi.com/2026/01/19/sway-layout/
27•__hugues•16h ago•4 comments