Text classification with Python 3.14's ZSTD module

https://maxhalford.github.io/blog/text-classification-zstd/

53•alexmolas•2d ago

Comments

Scene_Cast2•1h ago

Sweet! I love clever information theory things like that.

It goes the other way too. Given that LLMs are just lossless compression machines, I do sometimes wonder how much better they are at compressing plain text compared to zstd or similar. Should be easy to calculate...

EDIT: lossless when they're used as the probability estimator and paired with something like an arithmetic coder.

dchest•1h ago

https://bellard.org/ts_zip/

fwip•1h ago

Aren't LLMs lossy? You could make them lossless by also encoding a diff of the predicted output vs the actual text.

Edit to soften a claim I didn't mean to make.

thomasmg•50m ago

LLMs are good at predicting the next token. Basically you use them to predict what are the probabilities of the next tokens to be a, b, or c. And then use arithmetic coding to store which one matched. So the LLM is used during compression and decompression.

az09mugen•51m ago

I do not agree on the "lossless" adjective. And even if it is lossless, for sure it is not deterministic.

For example I would not want a zip of an encyclopedia that uncompresses to unverified, approximate and sometimes even wrong text. According to this site : https://www.wikiwand.com/en/articles/Size%20of%20Wikipedia a compressed Wikipedia without medias, just text is ~24GB. What's the medium size of an LLM, 10 GB ? 50 GB ? 100 GB ? Even if it's less, it's not an accurate and deterministic way to compress text.

Yeah, pretty easy to calculate...

throwaway81523•1h ago

Python's zlib does support incremental compression with the zdict parameter. gzip has something similar but you have to do some hacky thing to get at it since the regular Python API doesn't expose the entry point. I did manage to use it from Python a while back, but my memory is hazy about how I got to it. The entry point may have been exposed in the code module but undocumented in the Python manual.

ks2048•47m ago

This looks like a nice rundown of how to do this with Python's zstd module.

But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).

Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).

[0] https://kenschutte.com/gzip-knn-paper/

[1] https://kenschutte.com/gzip-knn-paper2/

shoo•30m ago

Good on you for attempting to reproduce the results & writing it up, and reporting the issue to the authors.

> It turns out that the classification method used in their code looked at the test label as part of the decision method and thus led to an unfair comparison to the baseline results

OutOfHere•24m ago

Can `copy.deepcopy` not help?

chaps•15m ago

Ooh, totally. Many years ago I was doing some analysis of parking ticket data using gnuplot and had it output a chart png per-street. Not great, but worked well to get to the next step of that project of sorting the directory by file size. The most dynamic streets were the largest files by far.

Another way I've used image compression to identify cops that cover their body cameras while recording -- the filesize to length ratio reflects not much activity going on.