ISO PDF spec is getting Brotli – ~20 % smaller documents with no quality loss

https://pdfa.org/want-to-make-your-pdfs-20-smaller-for-free/

169•whizzx•2w ago

Comments

delfinom•2w ago

tl;dr Commerical entity is paying to have the ISO altered to "legalize" their SDK they are pushing which is incompatible with standard PDF readers.

ISO is pay to play so :shrug:

bhouston•2w ago

I'm no fan of Adobe, but it is not that hard to add brotli support given that it is open. Probably can be added by AI without much difficulty - it is a simple feature. I think compared to the ton of other complex features PDF has, this is an easy one.

lmz•2w ago

It's not even clear that they were the ones suggesting inclusion. They're just saying their library now supports the new thing.

https://pdfa.org/brotli-compression-coming-to-pdf/

> As of March 2025, the current development version of MuPDF now supports reading PDF files with Brotli compression. The source is available from github.com/ArtifexSoftware/mupdf, and will be included as an experimental feature in the upcoming 1.26.0 release.

> Similarly, the latest development version of Ghostscript can now read PDF files with Brotli compression. File creation functionality is underway. The next official Ghostscript release is scheduled for August this year, but the source is available now from github.com/ArtifexSoftware/Ghostpdl.

adrian_b•2w ago

Yes, I do not see any source of financial gain that could motivate them for this, because both MuPDF and Ghostscript are free.

MuPDF is an excellent PDF reader, the fastest that I have ever tested. There are plenty of big PDF files where most other readers are annoyingly slow.

It is my default PDF and EPUB reader, except that in very rare cases I encounter PDF files which MuPDF cannot understand, when I use other PDF readers (e.g. Okular).

whizzx•2w ago

No this feature is coming straight from the PDF association itself and we just added experimental support before it's officially in the spec to help testing between different sdk processors.

So your comment is a falsehood

bhouston•2w ago

Are they using a custom dictionary with Brotli designed for PDFs? I am not sure if it would help or not, but it seems like one of those cases it may help?

Something like this:

https://developer.chrome.com/blog/shared-dictionary-compress...

In my applications, in the area of 3D, I've been moving away from Brotli because it is just so slow for large files. I prefer zstd, because it is like 10x faster for both compression and decompression.

whizzx•2w ago

The pdf association is still running experiments on whether or not to support custom dictionaries based on real life workloads gains.

So it might land in the spec once it has proven if offers enough value

Proclus•2w ago

It seems they're using the standard dictionary, which is utterly bizzare.

The standard Brotli dictionary bakes in a ton of assumptions about what the Web looked like in 2015, including not just which HTML tags were particularly common but also such things as which swear words were trendy.

It doesn't seem reasonable to think that PDFs have symbol probabilities remotely similar to the web corpus Google used to come up with that dictionary.

On top of that, it seems utterly daft to be baking that into a format which is expected to fit archival use cases and thus impose that 2015 dictionary on PDF readers for a century to come.

I too would strongly prefer that they use zstd.

bhouston•2w ago

BTW I've looked into custom dictionaries before for similar use cases and I suspect it would only offer like a 1% improvement or so for PDFs -- still good, but not a massive difference maker. The issue is that PDFs, like web pages, are incredibly repetitive in terms of their tags/structure. As such the custom dictionary only helps if the doc is really small, otherwise because of the repetitive nature, the self-inferred dictionary will resemble the custom dictionary after just a few blocks of PDF content.

The sole exception is if they are restarting the brotli stream for each page, and they are not sharing a dictionary, custom or inferred across the whole doc. Then the dictionary will have to be re-inferred on each page, and then a shared custom dictionary would make more sense.

bobpaw•2w ago

How can iText claim that adding Brotli is not a backward incompatible change (in the "Why keep encoding separate" table)? In the first section the author states that any new feature must work seamlessly with existing readers. New documents created that include this compression would be unintelligible to any reader that only supports Deflate.

Am I missing something? Adoption will take a long time if you can't be confident the receiver of a document or viewers of a publication will be able to open the file.

whizzx•2w ago

It's prototypish work to support it before it land's in the official specification. But it will indeed take some adoption time.

Because I'm doing the work to patch in support across different viewers to help adoption grow. And once the big opensource ones ship it pdfjs, poppler, pdfium, adoption can quickly rise.

croes•2w ago

There are old devices where the viewer can’t be patched. That’s killing one of the main features of PDF

ericpauley•2w ago

Some real cognitive dissonance in this article…

“The PDF Association operates under a strict principle—any new feature must work seamlessly with existing readers” followed by introducing compression as a breaking change in the same paragraph.

All this for brotli… on a read-many format like pdf zstd’s decompression speed is a much better fit.

xxs•2w ago

yup, zstd is better. Overall use zstd for pretty much anything that can benefit from a general purpose compression. It's a beyond excellent library, tool, and an algorithm (set of).

Brotli w/o a custom dictionary is a weird choice to begin with.

greenavocado•2w ago

This bizzare move has all the hallmarks of embrace-extend-extinguish rather than technical excellence

adzm•2w ago

Brotli makes a bit of sense considering this is a static asset; it compresses somewhat more than zstd. This is why brotli is pretty ubiquitous for precompressed static assets on the Web.

That said, I personally prefer zstd as well, it's been a great general use lib.

dist-epoch•2w ago

You need to crank up zstd compression level.

zstd is Pareto better than brotli - compresses better and faster

jeffbee•2w ago

Are you sure? Admittedly I only have 1 PDF in my homedir, but no combination of flags to zstd gets it to match the size of brotli's output on that particular file. Even zstd --long --ultra -22.

xxs•2w ago

on max compression (11 vs zstd's 22) of text brotli will be around 3-4% denser... and a lot slower. Decompression wise zstd is over 2x faster.

The pdfs you have are already compressed with deflate (zip).

atiedebee•2w ago

I thought the same, so I ran brotli and zstd on some PDFs I had laying around.

  brotli 1.0.7 args: -q 11 -w 24
  zstd v1.5.0  args: --ultra -22 --long=31 
                 | Original | zstd    | brotli
  RandomBook.pdf | 15M      | 4.6M    | 4.5M
  Invoice.pdf    | 19.3K    | 16.3K   | 16.1K

I made a table because I wanted to test more files, but almost all PDFs I downloaded/had stored locally were already compressed and I couldn't quickly find a way to decompress them.

Brotli seemed to have a very slight edge over zstd, even on the larger pdf, which I did not expect.

order-matters•2w ago

Whats the assumption we can potentially target as reason for the counter-intuitive result?

that data in pdf files are noisy and zstd should perform better on noisy files?

jeffbee•2w ago

What's counter-intuitive about this outcome?

order-matters•2w ago

maybe that was too strongly worded but there was an expectation for zstd to outperform. So the fact it didnt means the result was unexpected. i generally find it helpful to understand why something performs better than expected.

mort96•2w ago

Isn't zstd primarily designed to provide decent compression ratios at amazing speeds? The reason it's exciting is mainly that you can add compression to places where it didn't necessarily make sense before because it's almost free in terms of CPU and memory consumption. I don't think it has ever had a stated goal of beating compression ratio focused algorithms like brotli on compression ratio.

sgerenser•2w ago

I actually thought zstd was supposed to be better than Brotli in most cases, but a bit of searching reveals you're right... Brotli, especially at the highest compression levels (10/11), often exceeds zstd at the highest compression levels (20-22). Both are very slow at those levels, although perfectly suitable for "compress once, decompress many" applications which the PDF spec is obviously one of them.

mort96•2w ago

EDIT: Something weird is going on here. When compressing zstd in parallel it produces the garbage results seen here, but when compressing on a single core, it produces result competitive with Brotli (37M). See: https://news.ycombinator.com/item?id=46723158

I did my own testing where Brotli also ended up better than ZSTD: https://news.ycombinator.com/item?id=46722044

Results by compression type across 55 PDFs:

    +------+------+-----+------+--------+
    | none | zstd | xz  | gzip | brotli |
    +------|------|-----|------|--------|
    | 47M  | 45M  | 39M | 38M  | 37M    |
    +------+------+-----+------+--------+

mort96•2w ago

Turns out that these numbers are caused by APFS weirdness. I used 'du' to get them which reports the size on disk, which is weirdly bloated for some reason when compressing in parallel. I should've used 'du -A', which reports the apparent size.

Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):

    +---------+---------+--------+--------+--------+
    |  none   |  zstd   |   xz   |  gzip  | brotli |
    +---------|---------|--------|--------|--------|
    | 47.81M  | 37.92M  | 37.96M | 38.80M | 37.06M |
    +---------+---------+--------+--------+--------+

These numbers are much more impressive. Still, Brotli has a slight edge.

tracker1•2w ago

Worth considering the compress/decompress overhead, which is also lower in brotli than zstd from my understanding.

Also, worth testing zopfli since it's decompression is gzip compatible.

mrspuratic•2w ago

> I couldn't quickly find a way to decompress them

    pdftk in.pdf output out.pdf decompress

Thoreandan•2w ago

Does your source .pdf material have FlateDecode'd chunks or did you fully uncompress it?

atiedebee•2w ago

I wasn't sure. I just went in with the (probably faulty) assumption that if it compresses to less than 90% of the original size that it had enough "non-randomness" to compare compression performance.

atiedebee•2w ago

Ran the tests again with some more files, this time decompressing the pdf in advance. I picked some widely available PDFs to make the experiment reproducable.

  file            | raw         | zstd       (%)      | brotli     (%)     |
  gawk.pdf        | 8.068.092   | 1.437.529  (17.8%)  | 1.376.106  (17.1%) |
  shannon.pdf     | 335.009     | 68.739     (20.5%)  | 65.978     (19.6%) |
  attention.pdf   | 24.742.418  | 367.367    (1.4%)   | 362.578    (1.4%)  |
  learnopengl.pdf | 253.041.425 | 37.756.229 (14.9%)  | 35.223.532 (13.9%) |

For learnopengl.pdf I also tested the decompression performance, since it is such a large file, and got the following (less surprising) results using 'perf stat -r 5':

  zstd:   0.4532 +- 0.0216 seconds time elapsed  ( +-  4.77% )
  brotli: 0.7641 +- 0.0242 seconds time elapsed  ( +-  3.17% )

The conclusion seems to be consistent with what brotli's authors have said: brotli achieves slightly better compression, at the cost of a little over half the decompression speed.

dchest•2w ago

Not with small files.

Dylan16807•2w ago

If that's about using predefined dictionaries, zstd can use them too.

If brotli has a different advantage on small source files, you have my curiosity.

If you're talking about max compression, zstd likely loses out there, the answer seems to vary based on the tests I look at, but it seems to be better across a very wide range.

dchest•1w ago

No, it's literally just compressing small files without training zstd dict or plugging external dictionaries (not counting the built-in one that brotli has). Especially for English text, brotli at the same speed as zstd gives better results for small data (in kilobyte to a few of megabyte range).

DetroitThrow•2w ago

I love zstd but this isn't necessarily true.

itsdesmond•2w ago

> Pareto

I don’t think you’re using that correctly.

wizzwizz4•2w ago

It's correct use of Pareto, short for Pareto frontier, if the claim being made is "for every needed compression ratio, zstd is faster; and for every needed time budget, zstd is faster". (Whether this claim is true is another matter.)

stonogo•2w ago

brotli is ubiquitous because Google recommends it. While Deflate definitely sucks and is old, Google ships brotli in Chrome, and since Chrome is the de facto default platform nowadays, I'd imagine it was chosen because it was the lowest-effort lift.

Nevertheless, I expect this to be JBIG2 all over again: almost nobody will use this because we've got decades of devices and software in the wild that can't, and 20% filesize savings is pointless if your destination can't read the damn thing.

deepsun•2w ago

Brotli compresses my files way better, but it's doing it way slower. Anyway, universal statement "zstd is better" is not valid.

xxs•2w ago

On max compression "--ultra -22", zstd is likely to be 2-4% less dense (larger) on text alike input. While taking over 2x times times to compress. Decompression is also much faster, usually over 2x.

I have not tried using a dictionary for zstd.

deepsun•2w ago

Well, except for speed, compression algorithms need to be compared in terms of compression, you know.

Here's discussion by brotli's and zstd's staff:

https://news.ycombinator.com/item?id=19678985

mmooss•2w ago

Note the language: "You're not creating broken files—you're creating files that are ahead of their time."

Imagine a sales meeting where someone pitched that to you. They have to be joking, right?

I have no objection to adding Brotli, but I hope they take the compatability more seriously. You may need readers to deploy it for a long time - ten years? - before you deploy it in PDF creation tools.

nxobject•2w ago

(sarcasm warning...)

You're absolutely right! It's not just an inaccurate slogan—it's a patronizing use of artificial intelligence. What you're describing is not just true, it's precise.

mmooss•2w ago

I don't understand your point ...

eventualcomp•2w ago

The commenter is making a joke about the style of delivery of the sentence you quoted, because the style is [1]characteristic of AI generated writing.

[1]https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

spider-mario•2w ago

> on a read-many format like pdf zstd’s decompression speed is a much better fit.

brotli decompression is already plenty fast. For PDFs, zstd’s advantage in decompression speed is academic.

ksec•2w ago

Why not zstd?

PunchyHamster•2w ago

incompetence

whizzx•2w ago

You can read about it here https://pdfa.org/brotli-compression-coming-to-pdf/

jeffbee•2w ago

That mentions zstd in a weird incomplete sentence, but never compares it.

F3nd0•2w ago

They don’t seem to provide a detailed comparison showing how each compression scheme fared at every task, but they do list (some of) their criteria and say they found Brotli the best of the bunch. I can’t tell if that’s a sensible conclusion or not, though. Maybe Brotli did better on code size or memory use?

eviks•2w ago

Hey, they did all the work and more, trust them!!!

> Experts in the PDF Association’s PDF TWG undertook theoretical and experimental analysis of these schemes, reviewing decompression speed, compression speed, compression ratio achieved, memory usage, code size, standardisation, IP, interoperability, prototyping, sample file creation, and other due diligence tasks.

LoganDark•2w ago

I love when I perform all the due diligence tasks. You just can't counter that. Yes but, they did all the due diligence tasks. They considered all the factors. Every one. Think you have one they didn't consider? Nope.

jsnell•2w ago

But they didn't write "all". They wrote "other", which absolutely does not imply full coverage.

Maybe read things a bit more carefully before going all out on the snide comments?

wizzwizz4•2w ago

In fact, they wrote "reviewing […] other due diligence tasks", which doesn't imply any coverage! This close, literal reading is an appropriate – nay, the only appropriate – way to draw conclusions about the degree of responsibility exhibited by the custodians of a living standard. By corollary, any criticism of this form could be rebuffed by appeal to a sufficiently-carefully-written press release.

LoganDark•2w ago

It implies potential coverage of anything one could bring up. It creates a similar impression in my mind, because it becomes easy to claim you already considered something.

HackerThemAll•2w ago

I think this was the main reason (from the linked article) LOL:

"Brotli is a compression algorithm developed by Google."

They have no idea about Zstandard nor ANS/FSE comparing it with LZ77.

Sheer incompetence.

mort96•2w ago

I just took all PDFs I had in my downloads folder (55, totaling 47M). These are invoices, data sheets, employment contracts, schematics, research reports, a bunch of random stuff really.

I compressed them all with 'zstd --ultra -22', 'brotli -9', 'xz -9' and 'gzip -9'. Here are the results:

    +------+------+-----+------+--------+
    | none | zstd | xz  | gzip | brotli |
    +------|------|-----|------|--------|
    | 47M  | 45M  | 39M | 38M  | 37M    |
    +------+------+-----+------+--------+

Here's a table with all the files:

    +------+------+------+------+--------+
    | raw  | zstd | xz   | gzip | brotli |
    +------+------+------+------+--------+
    | 12K  | 12K  | 12K  | 12K  | 12K    |
    | 20K  | 20K  | 20K  | 20K  | 20K    | x5
    | 24K  | 20K  | 20K  | 20K  | 20K    | x5
    | 28K  | 24K  | 24K  | 24K  | 24K    |
    | 28K  | 24K  | 24K  | 24K  | 24K    |
    | 32K  | 20K  | 20K  | 20K  | 20K    | x3
    | 32K  | 24K  | 24K  | 24K  | 24K    |
    | 40K  | 32K  | 32K  | 32K  | 32K    |
    | 44K  | 40K  | 40K  | 40K  | 40K    |
    | 44K  | 40K  | 40K  | 40K  | 40K    |
    | 48K  | 36K  | 36K  | 36K  | 36K    |
    | 48K  | 48K  | 48K  | 48K  | 48K    |
    | 76K  | 128K | 72K  | 72K  | 72K    |
    | 84K  | 140K | 84K  | 80K  | 80K    | x7
    | 88K  | 136K | 76K  | 76K  | 76K    |
    | 124K | 152K | 88K  | 92K  | 92K    |
    | 124K | 152K | 92K  | 96K  | 92K    |
    | 140K | 160K | 100K | 100K | 100K   |
    | 152K | 188K | 128K | 128K | 132K   |
    | 188K | 192K | 184K | 184K | 184K   |
    | 264K | 256K | 240K | 244K | 240K   |
    | 320K | 256K | 228K | 232K | 228K   |
    | 440K | 448K | 408K | 408K | 408K   |
    | 448K | 448K | 432K | 432K | 432K   |
    | 516K | 384K | 376K | 384K | 376K   |
    | 992K | 320K | 260K | 296K | 280K   |
    | 1.0M | 2.0M | 1.0M | 1.0M | 1.0M   |
    | 1.1M | 192K | 192K | 228K | 200K   |
    | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M   |
    | 1.2M | 1.1M | 1.0M | 1.0M | 1.0M   |
    | 1.3M | 2.0M | 1.1M | 1.1M | 1.1M   |
    | 1.7M | 2.0M | 1.7M | 1.7M | 1.7M   |
    | 1.9M | 960K | 896K | 952K | 916K   |
    | 2.9M | 2.0M | 1.3M | 1.4M | 1.4M   |
    | 3.2M | 4.0M | 3.1M | 3.1M | 3.0M   |
    | 3.7M | 4.0M | 3.5M | 3.5M | 3.5M   |
    | 6.4M | 4.0M | 4.1M | 3.7M | 3.5M   |
    | 6.4M | 6.0M | 6.1M | 5.8M | 5.7M   |
    | 9.7M | 10M  | 10M  | 9.5M | 9.4M   |
    +------+------+------+------+--------+

Zstd is surprisingly bad on this data set. I'm guessing it struggles with the already-compressed image data in some of these PDFs.

Going by only compression ratio, brotli is clearly better than the rest here and zstd is the worst. You'd have to find some other reason (maybe decompression speed, maybe spec complexity, or maybe you just trust Facebook more than Google) to choose zstd over brotli, going by my results.

I wish I could share the data set for reproducibility, but I obviously can't just share every PDF I happened to have laying around in my downloads folder :p

noname120•2w ago

Why not use a more widespread compression algorithm (e.g. gzip) considering that Brotli barely performs better at all? Sounds like a pain for portability

mort96•2w ago

I'm not sold on the idea of adding compression to PDF at all, I'm not convinced that the space savings are worth breaking compatibility with older readers. Especially when you consider that you can just compress it in transit with e.g HTTP's 'Content-Encoding' without any special PDF reader support. (You can even use 'Content-Encoding: br' for brotli!)

If you do wanna change PDF backwards-incompatibly, I don't think there's a significant advantage to choosing gzip to be honest, both brotli and zstd are pretty widely available these days and should be fairly easy to vendor. But yeah, it's a slight advantage I guess. Though I would expect that there are other PDF data sets where brotli has a larger advantage compared to gzip.

But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)

ksec•2w ago

>But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)

I may dislike Google. But my support of JPEG XL and Zstd has nothing to do with competition tech being Google at all. I simply think JPEG XL and Zstd are better technology.

noname120•2w ago

Could you add compression and decompression speeds to your table?

mort96•2w ago

I just did some interactive shell loops and globs to compress everything and output CSV which I processed into an ASCII table, so I don't exactly have a pipeline I can modify and re-run the tests with compression speeds added ... but I can run some more interactive shell-glob-and-loop-based analysis to give you decompression speeds:

    ~/tmp/pdfbench $ hyperfine --warmup 2 \
    'for x in zst/*; do zstd -d >/dev/null <"$x"; done' \
    'for x in gz/*; do gzip -d >/dev/null <"$x"; done' \
    'for x in xz/*; do xz -d >/dev/null <"$x"; done' \
    'for x in br/*; do brotli -d >/dev/null <"$x"; done'
    Benchmark 1: for x in zst/*; do zstd -d >/dev/null <"$x"; done
      Time (mean ± σ):     164.6 ms ±   1.3 ms    [User: 83.6 ms, System: 72.4 ms]
      Range (min … max):   162.0 ms … 166.9 ms    17 runs
    
    Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
      Time (mean ± σ):     143.0 ms ±   1.0 ms    [User: 87.6 ms, System: 43.6 ms]
      Range (min … max):   141.4 ms … 145.6 ms    20 runs
    
    Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
      Time (mean ± σ):     981.7 ms ±   1.6 ms    [User: 891.5 ms, System: 93.0 ms]
      Range (min … max):   978.7 ms … 984.3 ms    10 runs
    
    Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
      Time (mean ± σ):     254.5 ms ±   2.5 ms    [User: 172.9 ms, System: 67.4 ms]
      Range (min … max):   252.3 ms … 260.5 ms    11 runs
    
    Summary
      for x in gz/*; do gzip -d >/dev/null <"$x"; done ran
        1.15 ± 0.01 times faster than for x in zst/*; do zstd -d >/dev/null <"$x"; done
        1.78 ± 0.02 times faster than for x in br/*; do brotli -d >/dev/null <"$x"; done
        6.87 ± 0.05 times faster than for x in xz/*; do xz -d >/dev/null <"$x"; done

As expected, xz is super slow. Gzip is fastest, zstd being somewhat slower, brotli slower again but still much faster than xz.

    +-------+-------+--------+-------+
    | gzip  | zstd  | brotli | xz    |
    +-------+-------+--------+-------+
    | 143ms | 165ms | 255ms  | 982ms |
    +-------+-------+--------+-------+

I honestly expected zstd to win here.

terrelln•2w ago

Zstd should not be slower than gzip to decompress here. Given that it has inflated the files to be bigger than the uncompressed data, it has to do more work to decompress. This seems like a bug, or somehow measuring the wrong thing, and not the expected behavior.

mort96•2w ago

It seems like zstd is somehow compressing really badly when many zstd processes are run in parallel, but works as expected when run sequentially: https://news.ycombinator.com/item?id=46723158

Regardless, this does not make a significant difference. I ran hyperfine again against a 37M folder of .pdf.zst files, and the results are virtually identical for zstd and gzip:

    +-------+-------+--------+-------+
    | gzip  | zstd  | brotli | xz    |
    +-------+-------+--------+-------+
    | 142ms | 165ms | 269ms  | 994ms |
    +-------+-------+--------+-------+

Raw hyperfine output:

    ~/tmp/pdfbench $ du -h zst2 gz xz br
     37M    zst2
     38M    gz
     38M    xz
     37M    br
    
    ~/tmp/pdfbench $ hyperfine ...
    Benchmark 1: for x in zst2/*; do zstd -d >/dev/null <"$x"; done
      Time (mean ± σ):     164.5 ms ±   2.3 ms    [User: 83.5 ms, System: 72.3 ms]
      Range (min … max):   162.3 ms … 172.3 ms    17 runs
    
    Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
      Time (mean ± σ):     142.2 ms ±   0.9 ms    [User: 87.4 ms, System: 43.1 ms]
      Range (min … max):   140.8 ms … 143.9 ms    20 runs
    
    Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
      Time (mean ± σ):     993.9 ms ±   9.2 ms    [User: 896.7 ms, System: 99.1 ms]
      Range (min … max):   981.4 ms … 1007.2 ms    10 runs
    
    Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
      Time (mean ± σ):     269.1 ms ±   8.8 ms    [User: 176.6 ms, System: 75.8 ms]
      Range (min … max):   261.8 ms … 287.6 ms    10 runs

terrelln•2w ago

Ah I understand. In this benchmark, Zstd's decompression time is 284 MB/s, and Gzip's is 330 MB/s. This benchmark is likely dominated by file IO for the faster decompressors.

On the incompressible files, I'd expect decompression of any algorithm to approach the speed of `memcpy()`. And would generally expect zstd's decompression speed to be faster. For example, on a x86 core running at 2GHz, Zstd is decompressing a file at 660 MB/s, and on my M1 at 1276 MB/s.

You could measure locally either using a specialized tool like lzbench [0], or for zstd by just running `zstd -b22 --ultra /path/to/file`, which will print the compression ratio, compression speed, and decompression speed.

[0] https://github.com/inikep/lzbench

noname120•2w ago

Thanks a lot. Interestingly Brotli’s author mentioned here that zstd is 2× faster at decompressing, which roughly matches your numbers:

https://news.ycombinator.com/item?id=46035817

I’m also really surprised that gzip performs better here. Is there some kind of hardware acceleration or the like?

terrelln•2w ago

> | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M |

Something is going terribly wrong with `zstd` here, where it is reported to compress a file of 1.1MB to 2MB. Zstd should never grow the file size by more than a very small percent, like any compressor. Am I interpreting it correctly that you're doing something like `zstd -22 --ultra $FILE && wc -c $FILE.zst`?

If you can reproduce this behavior, can you please file an issue with the zstd version you are using, the commands used, and if possible the file producing this result.

mort96•2w ago

Okay now this is weird.

I can reproduce it just fine ... but only when compressing all PDFs simultaneously.

To utilize all cores, I ran:

    $ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22 & done; wait

(and similar for the other formats).

I ran this again and it produced the same 2M file from the source 1.1M file. However when I run without paralellization:

    $ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22; done

That one file becomes 1.1M, and the total size of *.zst is 37M (competitive with Brotli, which is impressive given how much faster it is to decompress).

What's going on here? Surely '-22' disables any adaptive compression stuff based on system resource availability and just uses compression level 22?

Zekio•2w ago

doesn't zstd cap out at compression level 19?

mort96•2w ago

From the man page:

    --ultra: unlocks high compression levels 20+ (maximum 22), using a lot more memory.

Regardless, this reproduces with random other files and with '-9' as the compression level. I made a mastodon post about it here: https://floss.social/@mort/115940378643840495

terrelln•2w ago

Yeah, `--adaptive` will enable adaptive compression, but it isn't enabled by default, so shouldn't apply here. But even with `--adaptive`, after compressing each block of 128KB of data, zstd checks that the output size is < 128KB. If it isn't, it emits an uncompressed block that is 128KB + 3B.

So it is very central to zstd that it will never emit a block that is larger than 128KB+3B.

I will try to reproduce, but I suspect that there is something unrelated to zstd going on.

What version of zstd are you using?

mort96•2w ago

'zstd --version' reports: "** Zstandard CLI (64-bit) v1.5.7, by Yann Collet **". This is zstd installed through Homebrew on macOS 26 on an M1 Pro laptop. Also of interest, I was able to reproduce this with a random binary I had in /bin: https://floss.social/@mort/115940378643840495

I was completely unable to reproduce it on my Linux desktop though: https://floss.social/@mort/115940627269799738

terrelln•2w ago

I've figured out the issue. Use `wc -c` instead of `du`.

I can repro on my Mac with these steps with either `zstd` or `gzip`:

    $ rm -f ksh.zst
    $ zstd < /bin/ksh > ksh.zst
    $ du -h ksh.zst
    1.2M ksh.zst
    $ wc -c ksh.zst
     1240701 ksh.zst
    $ zstd < /bin/ksh > ksh.zst
    $ du -h ksh.zst
    2.0M ksh.zst
    $ wc -c ksh.zst
     1240701 ksh.zst
    
    $ rm -f ksh.gz
    $ gzip < /bin/ksh > ksh.gz
    $ du -h ksh.gz
    1.2M ksh.gz
    $ wc -c ksh.gz
     1246815 ksh.gz
    $ gzip < /bin/ksh > ksh.gz
    $ du -h ksh.gz
    2.1M ksh.gz
    $ wc -c ksh.gz
     1246815 ksh.gz

When a file is overwritten, the on-disk size is bigger. I don't know why. But you must have ran zstd's benchmark twice, and every other compressor's benchmark once.

I'm a zstd developer, so I have a vested interest in accurate benchmarks, and finding & fixing issues :)

mort96•2w ago

Interesting!

It doesn't seem to be only about overwriting, I can be in a directory without any .zst files and run the command to compress 55 files in parallel and it's still 45M according to 'du -h'. But you're right, 'wc -c' shows 38809999 bytes regardless of whether 'du -h' shows 45M after a parallel compression or 38M after a sequential compression.

My mental model of 'du' was basically that it gives a size accurate to the nearest 4k block, which is usually accurate enough. Seems I have to reconsider. Too bad there's no standard alternative which has the interface of 'du' but with byte-accurate file sizes...

terrelln•2w ago

Yeah, it isn't quite that simple. E.g. `/bin/ksh` reports 1.4MB, but it is actually 2.4MB. Initially, I thought it was because the file was sparse, but there are only 493KB of zeros. So something else is going on. Perhaps some filesystem-level blocks are deduped from other files? Or APFS has transparent compression? I'm not sure.

It does still seem odd that APFS is reporting a significantly larger disk-size for these files. I'm not sure why that would ever be the case, unless there is something like deferred cleanup work.

mort96•2w ago

Ross Burton on Mastodon suggests that it might be deduplication; when writing sequentially, later files can re-use blocks from earlier files, while that isn't the case as much when writing sequentially. That seems plausible enough to me.

mort96•2w ago

I've concluded that this can't be the reason. It'd only result in an error where the size reported by 'du' is smaller than the apparent size (aka number of bytes reported by 'wc -c') of the file. What we see here is that the size reported by 'du' is almost twice as large as the number of bytes. That can't be the result of dedpulication.

I'll chalk it up to "some APFS weirdness".

gcr•2w ago

If you're worried about double-compression of image data, you can uncompress all images by using qpdf:

    qpdf --stream-data=uncompress in.pdf out.pdf

The resulting file should compress better with zstd.

mort96•2w ago

Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):

    +---------+---------+--------+--------+--------+
    |  none   |  zstd   |   xz   |  gzip  | brotli |
    +---------|---------|--------|--------|--------|
    | 47.81M  | 37.92M  | 37.96M | 38.80M | 37.06M |
    +---------+---------+--------+--------+--------+

These numbers are much more impressive. Still, Brotli has a slight edge.

cortesoft•2w ago

I can’t imagine the people actually doing the technical work don’t know about Zstandard.

cess11•2w ago

'Your PDF:s will open slower because we decided that the CDN providers are more important than you'.

If size was important to users then it wouldn't be so common that systems providers crap out huge PDF files consisting mainly of layout junk 'sophistication' with rounded borders and whatnot.

The PDF/A stuff I've built stays under 1 MB for hundreds of pages of information, because it's text placed in a typographically sensible manner.

noname120•2w ago

Ridiculous statement. CDN providers can already use filesystem compression and standard HTTP Accept-Encoding compression for transfers (which includes brotli by the way). This ISO provides virtually no benefit to them

cess11•2w ago

This reasoning comes from TFA.

h4x0rr•2w ago

Wouldn't lzma2 be better here since a pdf is more read heavy?

F3nd0•2w ago

Going by one of Brotli’s authors’ comment [1] on another post, it probably wouldn’t.

[1] https://news.ycombinator.com/item?id=46035817

nialse•2w ago

Who is responsible for the terrible decision? In the pro vs con analysis, saving 20% size occasionally vs updating ALL pdf libraries/apps/viewers ever built SHOULD be a no-brainer.

ndriscoll•2w ago

What is the point of using a generic compression algorithm in a file format? Does this actually get you much over turning on filesystem and transport compression, which can transparently swap the generic algorithm (e.g. my files are already all zstd compressed. HTTP can already negotiate brotli or zstd)? If it's not tuned to the application, it seems like it's better to leave it uncompressed and let the user decide what they want (e.g. people noting tradeoffs with bro vs zstd; let the person who has to live with the tradeoff decide it, not the original file author).

eru•2w ago

Well, if sanity had prevailed, we would have likely stuck to .ps.gz (or you favourite compression format), instead of ending up with PDF.

Though we might still want to restrict the subset of PostScript that we allow. The full language might be a bit too general to take from untrusted third parties.

dunham•2w ago

Don't you end up with PDF if you start with PS and restrict it to a subset? And maybe normalize the structure of the file a little. The structure is nice when you want to take the content and draw a bit more on the page. Or when subsetting/combining files.

I suspect PDF was fairly sane in the initial incarnation, and it's the extra garbage that they've added since then that is a source of pain.

I'm not a big fan of this additional change (nor any of the javascript/etc), but I would be fine with people leaving content streams uncompressed and running the whole file through brotli or something.

mikkupikku•2w ago

I thought PDFs can contain arbitrary PS.

eru•2w ago

> Don't you end up with PDF if you start with PS and restrict it to a subset?

PDF is also a binary format.

lmz•2w ago

Compression filters are in PostScript.

Someone•2w ago

- inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text

- when jumping from page to page, you won’t have to decompress the entire file

wizzwizz4•2w ago

> inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text

Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.

> when jumping from page to page, you won’t have to decompress the entire file

This is already a thing with any compression format that supports quasi-random access, which is most of them. The answers to https://stackoverflow.com/q/429987/5223757 discuss a wide variety of tools for producing (and seeking into) such files, which can be read normally by tools not familiar with the conventions in use.

Someone•2w ago

> Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.

Far from the same amount:

- existing tools that split PDFs into pages will remain working

- if defensively programmed, existing PDF readers will be able to render PDFs containing JPEG XL images, except for the images themselves.

wongarsu•2w ago

Few people enable file system compression, and even if they do it's usually with fast algorithms like lz4 or zstd -1. When authoring a document you have very different tradeoffs and can afford the cost of high compression levels of zstd or brotli.

avalys•2w ago

This article is AI slop.

jeffbee•2w ago

Yep.

superkuh•2w ago

This is nice, but PDF jumped the shark already. It's no longer a document format that always looks the same everywhere. The inclusion of "Dynamic XFA (XML Form Architecture) PDF" in the spec made it so PDF is an unreliable format. The aformentioned is a PDF without content that pulls down all it's content from the web. It even still, ostensibly, supports Flash (swf) animations. In practice these "PDF"s are just empty white pages with an error message like,

>"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download. For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries."

kayodelycaon•2w ago

Fortunately, XFA is deprecated. I haven’t seen one of those for a very long time.

superkuh•2w ago

Maybe in spec, but the damage is done and persists.

The (USA) Wisconsin Dept. of Natural Resources has nearly all their regulation PDFs as these XFA non-pdfs that I cannot read. So I cannot know the regulations. My emails about this topic (to multiple addresses over many years a dozen times) have gone unanswered.

If Acrobat supports it it doesn't matter what the spec says. Until Adobe drops XFA from Acrobat and forces these extremely silly people to stop, PDF is no longer PDF.

whinvik•2w ago

I am often frustrated by PDF issues such as how complicated it is to create one.

But reading the article I realized PDFs have become ubiquitous because of its insistence on backwards compatibility. Maybe for some things it's good to move this slow.

jhealy•2w ago

The article is wrong, the PDF spec has introduced breaking changes plenty of times. It’s done slowly and conservatively though, particularly now that the format is an ISO spec.

The PDF format is versioned, and in the past new versions have introduced things like new types of encryption. It’s quite probable that a v1.7 compliant PDF won’t open on a reader app written when v1.3 was the latest standard.

nbevans•2w ago

This is a really really bad idea. Don't break backwards compat. for 20% of gains. Internet connection speeds and storage capacities only go up. In a few years time, 20% of gains will seem crazy to have broken back-compat for.

gcr•2w ago

If we're making breaking changes to PDFs, I'd love if the committee added a modern image format like JPEG-XL. In my experience, most disk usage of PDFs comes from images, not streams.

I keep a bunch of comics in PDF but JPEG-XL is by far the best way to enjoy them in terms of disk space.

Bolwin•2w ago

Odd you should say that, as that's exactly what they've been discussing

gcr•2w ago

No it's not. This article is about proposing Brotli as another possible '/Filter' for stream objects, like content streams (page drawing commands). Images are streams too, but unless you mean compressing raw pixel bytes in Brotli, there's no mention of a JPEG-XL or WEBP filter.

NoahZuniga•2w ago

well, not mentioned in this specific article. But JPEG-XL support is something they're working on [1].

[1]: https://pdfa.org/wp-content/uploads/2025/10/PDFDays2025-Brea...

gcr•2w ago

Oh cool!! TIL

We Mourn Our Craft

I Write Games in C (yes, C)

Hoot: Scheme on WebAssembly

SectorC: A C Compiler in 512 bytes

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

We Mourn Our Craft

I Write Games in C (yes, C)

Hoot: Scheme on WebAssembly

SectorC: A C Compiler in 512 bytes

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

ISO PDF spec is getting Brotli – ~20 % smaller documents with no quality loss

Comments