Instead of training the model to generalize, I train a 900KB transformer to memorize a single file and predict the next byte. Those predictions are fed into an arithmetic coder to produce the compressed output.
On a 100MB NYC taxi CSV, it compresses to about 7MB (~0.5 bits/byte). On a 100MB slice of enwik9, it compresses to about 21MB (~1.68 bits/byte).
It's pretty slow right now (roughly 20–30 minutes of training and 45 minutes each for compression and decompression on my AMD 7800XT).
Checkout the repo - https://github.com/samyak112/pym-particles
7373737373•2d ago
spidy__•2d ago
I know the top submission was able to get it to 13 mb.
Still trying some ideas to get better compression.
gravypod•1h ago
purple-leafy•7h ago
cellular•1h ago
Edit: oh wait that's too easy. Need to generate /publish random digits so everyone can use it.
saulpw•1h ago
SV_BubbleTime•1h ago
Random data does not mean it does not match a pattern in your dictionary for example.
gnabgib•1h ago
[0]: https://en.wikipedia.org/wiki/Randomness
[1]: https://en.wikipedia.org/wiki/Data_compression
thin_carapace•53m ago
ufocia•