Available-Dictionary: : =:
It seems very odd to use a colon as starting and ending delimiter when the header name is already using a colon. Wouldn’t a comma or semicolon work better?There's almost no added complexity since zstd already handles separate compression dictionaries quite well.
Brotli has a default dictionary with bits of HTML and scripts. This is built in into the decompressor, and not sent with the files.
The decompression dictionaries aren't magic. They're basically a prefix for decompressed files, so that a first occurrence of some pattern can be referenced from the dictionary instead of built from scratch. This helps only with the first occurrences of data near the start of the file, and for all the later repetitions the dictionary becomes irrelevant.
The dictionary needs to be downloaded too, and you're not going to have dictionaries all the way down, so you pay the cost of decompressing the data without a dictionary whether it's a dictionary + dictionary-using-file, or just the full file itself.
Which is why the idea is to use a previous version of the same file, which you already have cached from a prior visit to the site. You pay the cost of decompressing without a dictionary, but only on the first visit. Basically it's a way to restore the benefits of caching for files that change often, but only a little bit each time.
Having a dictionary created by actual content to be compressed will end up with a very different dictionary.
We already have a way to manage this: Standardizing and versioning dictionaries for various media types (also with a checksum), and then just caching them locally forever, since they should be immutable by design.
To prevent an overgrowth of dictionaries with small differences, we could require each one to be an RFC.
show significant gain of using dictionary over compressed w/o dictionary.
It seems like instead of sites reducing bloat, they will just shift the bloat to your hard-drive. Some of the examples said dictionary of 1MB which doesn't seem big, but could add up if everyone is doing this.
For example, take the CNN example:
> The JavaScript was 98% smaller using the previous version as a dictionary for the new version than if the new version was downloaded with brotli alone. Specifically, the 278kb JavaScript was 90kb with brotli alone and 2kb when using brotli and the previous version as a dictionary.
Oh wow! 98% savings! That's amazing! Except in absolute terms the difference between 90 KB and 2 KB is only 88 KB. Meanwhile, cnn.com pulls in 63.7 MB of data just on the first page load. So in reality, that 88 KB saved was less than 0.14% of the total data, which is negligible.
Analyze the most common responses of a website on their platform, build an efficient dictionary from that data, and then automatically inject a link to that site-specific dictionary so future responses are optimally compressed and save on bandwidth. All transparent to the customers and end users.
However, I'm sceptical about usefulness of multi-page shared dictionaries (where you construct one for a site or group of pages). They're a gamble that can backfire.
The extra dictionary needs to be downloaded, so it starts as an extra overhead. It's not enough for it to just match something. It has to beat regular (per-page) compression to be better than nothing, and it must be useful enough to repay its own cost before it even starts being a net positive. This basically means everything in the dictionary must be useful to a user, and has to be used more than once, otherwise it's just an unnecessary upfront slowdown.
Standard (per-page) compression is already very good at removing simple repetitive patterns, and Brotli even comes with a default built-in dictionary of random HTML-like fragments. This further narrows down usefulness of the shared dictionaries, because generic page-like content is enough to be an advantage. They need to contain more specific content to beat standard compression, but the more specific the dictionary is, the lesser the chance of it fitting what the user browses.
Previously servers would cache compressed versions of your static resources.
Whereas now they either have to compress on-the-fly or have a massive cache of not only your most recent static JavaScript blob, but also all past blobs and versions compressed using different combinations of them as a dictionary.
This could easily 10x resources needed for serving static html/CSS/js.
Then the server is doing more work at request time, but it's not meaningfully more work --- just checking if the request path has a dictionary compressed form that matches the dictionary hash provided by the client.
Thinking about it a bit more, we are doing this at the character level- a Unicode table, so why can’t we lookup words or maybe even common sentences ?
There's every possible text in Pi, but on average it's going to cost the same or more to encode the location of the text than the text itself.
To get compression, you can only shift costs around, by making some things take fewer bits to represent, at the cost of making everything else take more bits to disambiguate (e.g. instead of all bytes taking 8 bits, you can make a specific byte take 1 bit, but all other bytes will need 9 bits).
To be able to reference words from an English dictionary, you will have to dedicate some sequences of bits to them in the compressed stream.
If you use your best and shortest sequences, you're wasting them on picking from an inflexible fixed dictionary, instead of representing data in some more sophisticated way that is more frequently useful (which decoders already do by building adaptive dictionaries on the fly and other dynamic techniques).
If you try to avoid hurting normal compression and assign less valuable longer sequences of bits to the dictionary words instead, these sequences will likely end up being longer than the words themselves.
Allowing for changing the message obviously means that things like malware become a possibility.
o11c•7mo ago