frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Show HN: Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

97•spidy__•3d ago
I built an experiment that uses an overfitted transformer and arithmetic coding to compress individual files.

Instead of training the model to generalize, I train a 900KB transformer to memorize a single file and predict the next byte. Those predictions are fed into an arithmetic coder to produce the compressed output.

On a 100MB NYC taxi CSV, it compresses to about 7MB (~0.5 bits/byte). On a 100MB slice of enwik9, it compresses to about 21MB (~1.68 bits/byte).

It's pretty slow right now (roughly 20–30 minutes of training and 45 minutes each for compression and decompression on my AMD 7800XT).

Checkout the repo - https://github.com/samyak112/pym-particles

Comments

7373737373•3d ago
What does it compress the full 1GB file to? http://prize.hutter1.net/
spidy__•3d ago
I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb.

I know the top submission was able to get it to 13 mb.

Still trying some ideas to get better compression.

gravypod•19h ago
Since you know the size of the file beforehand you may be able to overfit some kind of text diffusion model instead of a transformer? May allow you to partially correct the model output using some other method and then fill in the blanks that were wrong from previous generations.
spidy__•17h ago
Oh, sounds interesting. I hadn't considered using a diffusion model for this. My current approach generates the document byte by byte with an autoregressive transformer, so I'm curious how a diffusion model would improve memorization or reconstruction quality.

Can you point me to something that i can read? I really wanna try this approach , diffusion model does sounds interesting for compression.

atiedebee•11h ago
Which slice? The large text compression benchmark uses enwik8 for a "smaller" input that is easily reproducible. The predictability of enwik9 can vary significantly depending on where in the file you are, as shown by Matt Mahoney https://www.mattmahoney.net/dc/textdata.html
purple-leafy•1d ago
Thanks for the link!
cellular•19h ago
Maybe everyone should compress the 1st 100MB worth of digits of pi, for an apples-to-apples comparison?

Edit: oh wait that's too easy. Need to generate /publish random digits so everyone can use it.

saulpw•19h ago
random digits aren't compressible though?
SV_BubbleTime•19h ago
Random digits are compressible though.

Random data does not mean it does not match a pattern in your dictionary for example.

gnabgib•19h ago
No.. they're not. Do you understand random (the apparent or actual lack of definite patterns or predictability[0]) or compression (reduces bits by identifying and eliminating statistical redundancy[1])?

[0]: https://en.wikipedia.org/wiki/Randomness

[1]: https://en.wikipedia.org/wiki/Data_compression

thin_carapace•19h ago
by this definition, a random dataset could apparently present no patterns, while presenting non apparent patterns.
ufocia•
purple-leafy•2d ago
That’s so awesome! I want to try something similar. I’ve been going crazy with compression work. I reckon I can beat that prize link
spidy__•1d ago
Reallly?? So have you published something so far? Can i read something? Sounds like you got some interesting ideas.
purple-leafy•1d ago
I will be showcasing something on hackernews soon! Basically I found a way to “compress” a multiplayer game state from ~100KB+ to ~1KB

But it’s only for the game I’m building and it’s not pure compression work, I had to do some tricky things

purple-leafy•1d ago
And just for comparison, my absolute best compression method managed to get down to 10s of KB, but the real unlock got to the ~1KB figures. Note these numbers are ALL post-compression numbers. This is not raw data vs compressed data. The ~100KB figure IS POST COMPRESSION.

For context these numbers are for a grid based game where players can perform 4 actions per second, and the numbers I’m sharing are for 30 minutes of gameplay with anywhere from 2-1024+ players (human players) playing simultaneously

So if you do the math, my compression feat is effectively ~99% compression on naive best case. And if you compare it to the raw data, it’s closing in on an even higher number than that I haven’t done the math but the raw data is another factor of 10 greater than ~100KB so the “compression” versus raw data is ~99.9%

It sounds absolutely bullshit I know :D

But I will be posting a blog post soon once I release the game.

I do compression in quotes because it’s not a pure compression feat, the 99%+ feat is effectively being clever about what actually requires compression to achieve the same outcome

tae0086•1d ago
Neat approach. Since the 900KB model ships with the compressed file, is there a file size below which the model overhead just eats the gains? Curious where the crossover is.
spidy__•1d ago
For the model overhead to become significant enough to eat into the gains, the file size would need to be fairly small, right? I assumed nobody would use this for compressing anything below 100 MB.

I tested with 100 MB files because anything larger takes a long time to evaluate. The actual target was at least 1 GB, and in that case I would use a 100 MB model (Shannon entropy rules).

I also tried it on a 100 MB Photoshop file and was able to compress it down to 45 MB, whereas ZIP could only get it down to 60 MB. So yeah still not losing gains.

userbinator•20h ago
Fabrice Bellard may have been the first to do this, 7 years ago: https://news.ycombinator.com/item?id=27244004
spidy__•17h ago
Yeah yeah, I just found the idea kinda interesting so wanted to implement it
touisteur•16h ago
Keep exploring and writing (please ?) ! Love seeing people explore ... even after Fabrice Bellard had a go at it.
spidy__•15h ago
Ofcourse, am doing this just because I enjoy it.

While we are on the discussion I have mentioned a question at the end of the discussion around an assumption am trying, can you please check it out and see if you have any suggestions?

Would be awesome if someone can validate or help.

pentaphobe•13h ago
Hope this isnt too spicy a take, but i find it a bit disingenuous to use language that implies invention (and with no mention or citation of previous work), only to switch to dismissive language when someone notes a predecessor who you've apparently already heard of
SubiculumCode•19h ago
What do those compress to with conventional approaches? For comparison.

I am curious. A classic machine learning ensemble approach is to overfit a collection of small models then bag them (e.g. voting) allowing the models to generalize.

I'm sure someone's tried to overfit a bunch of transformers for compression like this, then bag them to see how well it does?

gwern•18h ago
Ensembling is not compute or parameter-efficient, so compression per se is a terrible application. (This is related to why people train ever larger LLMs like 1 10t-parameter LLM, rather than 100 GPT-3-scale LLMs.)
SubiculumCode•16h ago
Yeah.
fsiefken•14h ago
conventional algorithms https://www.mattmahoney.net/dc/text.html
wildstrawberry•19h ago
Three questions:

1. How much was AI used to generate documentation for this project?

2. The 100MB CSV data sources are not provided in the repo so it doesn't seem possible to reproduce your results. The enwik9 dataset says it is a "slice" of the larger data set, and there are many NYC taxi trip record datasets that exist. Can you provide the datasets used to generate your results?

3. I am surprised to see performance comparisons only between your transformer and WinZIP. What were your results when comparing your transformer to more modern approaches like LZMA2 (level 9), BZIP2 and ZPAQ (max effort)?

spidy__•17h ago
1. I wrote the content as what i want to mention in the documentation and just used AI to polish it so that its easy to understand, is it hard to understand the documentation right now?

2. Have added the link for downloading both the enwik9 slice and the nyc dataset. Apologies I forgot to add it.

You can get it from here - https://github.com/samyak112/pym-particles/blob/main/README....

3. Other than zip i tested it with zstd19, and now that you mentioned LZMA2 and BZIP2

I got results on enwik9 100mb slice as

zstd - 28mb bzip2 - 30mb lzma2 - 26mb

I will mention these and results from ZPAQ in the readme for both files, thanks for pointing them out!!!

But the thing is this neural compression approach cant be used right now, as it takes hours to compress and de compress a 100mb file so not really usable and more of a fun project.

IncreasePosts•16h ago
These algorithms let you specify a compression level - please note in the docs which you used. The window size can also be adjusted. Zstd might default to 4, which is "goodish compression but fast"
spidy__
rtpg•19h ago
I've had this idea of building a codec that would similarly overfit to specific images. But the codec itself would not be a fixed size transformer... instead you could just mess around with the sizing to get better quality/smaller size.

So the codec would be something like: <header describing image size + transformer layer shape> <transformer data itself>

I've seen experiments where people have a "fixed" pipeline but I think having something more dynamic would work quite well.

dvt•17h ago
Likely doable with metaparameter tuning (used to work on a team with data scientists that were routinely doing this in various situations). Seems like a cool idea.
IncreasePosts•18h ago
Isnt this what auto encoders are for?
jxmorris12•18h ago
Lo and behold, a nice arithmetic coding implementation that wasn't written by an LLM! A sight for sore eyes – a treat, even. Looks like it was written by someone else though.

Check it out: https://github.com/samyak112/pym-particles/blob/main/arithme...

spidy__•17h ago
Ohh yeah , I took it from Project Nayuki as mentioned in the file as well, i tried to pip install it but there were some issues so just took the file and kept the copy right as it is.

Its not an issue is it? I am not sure.

jmspring•18h ago
The model is the important part, a huffman code or adaptive huffman or other sorts of encoders would be much better on a dataset based on the model. You need the model to also decode. And on a dataset of sufficient size, embedding the model and the benefit of it's memorization of the file can be offset.

A non-general compression algorithm (model - I don't mean a distinct llm, but "modeling data") targeted at a specific dataset will always do better than a general algorithm.

The reason I mentioned the "encoder" doesn't matter - arithmetic coding, for the data it is presented, will beat huffman/adaptive huffman every day, but it's the model that is where the real "compression" comes into play.

I've implemented enough "coders" over the years, including arithmetic for both commercial and research purposes (was a student of Glen Langdon).

purple-leafy•18h ago
Dumb question: can you train a model to predict the next byte of ANOTHER MODEL

So apply this same logic to compressing a bigger model within a smaller model

I know this is absolutely regarded, but humour me please

anyg•18h ago
Not dumb at all. It's a whole field of active research - Speculative Decoding. A recent paper goes one level deeper with Speculative Speculative Decoding - https://arxiv.org/abs/2603.03251
purple-leafy•18h ago
Oh man awesome! I’m so S-M-R-T

Compression is such an interesting field

userbinator•17h ago
If there's any redundancy in the model that can be compressed (parallel to how RLE is used to compress the static Huffman tree in FLATE) that's possible, but it's not necessary if the model is being trained on the input dynamically, like what Bellard's NNCP does.
test1072•18h ago
So has anyone tried to you know for example keep constant weights base model and just transmit the data, might be better compression
spidy__•17h ago
I might be confused by the question, but I overfit the model on a single file and then transport the model along with the arithmetic coding file. There have been ideas where you generalize a model (constant weights) and then pass the arithmetic coding file along with it. So that way you only pass the arithmetic coding file.

BUT my model size is just 900KB (for 100mb file atleast) so it is negligible

totetsu•18h ago
if first bit is 1 then decompress to a picture of my cat, else its just Huffman
whacked_new•17h ago
Somewhat related is stavros's method to compress 500KB to something like 50 bytes https://www.stavros.io/posts/compressing-images-with-stable-...

main drawback is that it's not lossless ;-)

but this is great. I hope this actually becomes a format that wraps the weights and transformer module (maybe this can also be NAS-optimized too?). Maybe it would even work for video?

It's like calling gzip but instead of compression level you choose kolmogorov complexity level

isoprophlex•17h ago
> There are some minor kinks that need to be worked out, such as the fact that each image takes around a day to generate on mobile, but this is more than acceptable in certain domains. Website visitors, for example, are well-accustomed to such loading times, and would barely notice any difference.

Just amazing, wow

userbinator•16h ago
Maybe it would even work for video?

While clearly satirical, it's definitely quite thought-provoking from various angles including the basis of information, representation of data, and even copyright. It's like watching a movie, writing a book based on it, and then making another movie based on that book.

VorticonCmdr•15h ago
Great work. Just Yesterday I thought about LLMzip and asked myself if this is something which could vastly improve HTML compression when done at Google scale and shipped with browsers. I haven't done any research though.
spidy__•15h ago
I mean neural compressors provide great compression, BUTT the issue is they are really slow like in my project it takes around 45 minutes for de compression of 100 mb so I doubt if it would be useful, also using a transformer in user's browser sounds like a heavy task.
18h ago
Sounds like presenting no patterns, apparently or otherwise, would be a pattern in itself.
saulpw•7h ago
That's like a teenage "i am very smart" thinking. I mean sure we can look at some string of random bits and say "that looks random" but you can't just generate any old string of random bits to replace it (which would be the only 'pattern' that could be leveraged for compression here). If it's encrypted it'll also appear random, and therefore not be compressible, but you have to encode every byte exactly or the message won't be decryptable.
IncreasePosts•16h ago
Over infinite runs, you can't compress random data, but that doesn't mean any finite string of random digits is incompressible
gcr•10h ago
I could write a program to generate the first 100MB of pi in a couple kilobytes. That certainly counts as “data compression” but isn’t useful outside this particular problem instance.
branc116•18h ago
Compressor: Output an empty file.

Decompressor: Take any old algorithm for finding digets of pi, find first 100M of them, print them.

Compression ratio of 0! :0

spidy__•21h ago
Sounds interesting man, soo am a bit confused maybe but can you run this on enwik9?
purple-leafy•19h ago
Probably not lol, it’s very specific to PvP multiplayer games, tested on my own game. But maybe I can extract the core concept to enwiki9 but I doubt it
andai•19h ago
I was working on a multiplayer game a while ago, and one of the iterations of the netcode was "thin client" where clients just sent input, server simulated the game, and it dumped world state onto the pipe at 60hz. I didn't ship that version but I estimated a $3000 bandwidth bill with that approach!

I started looking into diffing the state, compression, etc... until I realized, wait a minute! My player movement is linear so I only need a packet for start and stop! And so I achieved near infinite efficiency improvement :)

I think the word is... a specialized solution can beat a general one.

Also, "remembering what the program actually needs to do, and just making it do that"... I de-pessimized the netcode: https://youtube.com/watch?v=pgoetgxecw8

purple-leafy•18h ago
$3000 bill wow!!

Clever insight :) yes a specialised solution usually wins! Good effort

Did you end up publishing your game?

andai•1h ago
My game has been in development hell for several years. I've worked out most of the single-node issues and now I'm working out how to scale it to multiple nodes. (It is embarrassingly parallel, but I am embarrassingly stupid!)

Meanwhile I'm learning how to ship a trivial single player game -- I went with Pong, and it turns out polishing Pong and getting it ready for mass consumption is a surprisingly labor-intensive endeavour (yes, even with the fabled miracle machine assistants!)

But, it should probably be online by the end of July. Like, for real this time :)

What are you working on?

•
15h ago
I tried with zstd 19
wildstrawberry•8h ago
Appreciate the followup!!!

There were a few tells of AI based on my use of AI for personal projects, especially the end section where it says "what I tried that didn't work", I've seen Claude put sections like that in the documentation.

Big thanks for linking the datasets. Here's my results from ZPAQ (max effort):

./zpaq.exe add archive.zpaq nyc_taxi_dataset_100mb_slice.txt -m5

Time: 218 seconds

Final size: 9.57MB

./zpaq.exe add archive.zpaq enwik_9_slice_100mb.txt -m5

Time: 199 seconds

Final size: 20.46MB

So, your approach is comparable to ZPAQ for the wiki dataset and achieves a better compression ratio than ZPAQ with the taxi dataset. Cool!

Bit of a tangent but if you're curious there's an interesting writeup from a few years back that compares lossless text compression algorithms at various effort levels (speed vs compression ratio). I read it recently which prompted all of my questions

https://giannirosato.com/blog/post/lossless-data-comp/

spidy__•8h ago
Ohh that "what I tried that didn't work" was a section that I specifically wrote myself (and then polished with AI) because I wanted to document what are the different approaches I tried to compress more but failed.

Also thanks for the reference looks like a interesting read.

Show HN: Smart model routing directly in Claude, Codex and Cursor

https://github.com/workweave/router
131•adchurch•6h ago•83 comments

Show HN: Autofit2 – End-to-end pipeline for multilingual text classification

https://github.com/neospe/autofit2
11•leschak•1d ago•0 comments

Show HN: WebBase-III – dBASE III rebuilt in the browser with its own interpreter

https://github.com/DDecoene/WebBaseIII
76•ddecoene•2d ago•26 comments

Show HN: OpenKnowledge – open source AI-first alternative to Obsidian/Notion

https://github.com/inkeep/open-knowledge
360•engomez•1d ago•168 comments

Show HN: I Derived a Steak

https://www.absurdlyoptimized.com/recipes/grilled-meats/
4•bkazez•2h ago•0 comments

Show HN: Mantis, A self-hosted LLM gateway

https://github.com/mantis-llm-gateway
5•rizsyed1•4h ago•0 comments

Show HN: Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

97•spidy__•3d ago•60 comments

Show HN: Chess-Inspired Roguelike

https://princechazz.com
423•cowboy_henk•5d ago•146 comments

Show HN: I made Google Trends for Hacker News by indexing 18 years of comments

https://hackernewstrends.com
761•ytkimirti•1d ago•151 comments

Show HN: Closing the public-key authenticity gap in our E2EE social network

https://mosslet.com/blog/articles/19
2•mosspigletdev•5h ago•0 comments

Show HN: AgentBrush – Your coding agent's missing tool: image generation

https://agentbrush.dev/
4•Yan4300•5h ago•0 comments

Show HN: Turn native language audio into flashcards and shadowing practice

https://lingochunk.com/try
87•alder•1d ago•36 comments

Show HN: I built a hardware quantum RNG and wired it into a Magic 8-Ball

https://dnhkng.github.io/posts/building-the-beam-universe-splitter/
10•dnhkng•5h ago•0 comments

Show HN: MiniPCs.zip – Charting the Pareto frontier of Mini PCs

https://minipcs.zip
113•yathern•6d ago•46 comments

Show HN: A map of every UK railway, including stations that no longer exist

https://trainmap.co.uk/map.html
2•optionalltd•5h ago•0 comments

Show HN: TBD, a Mac-native CLI-forward coding agent multiplexer

https://github.com/cheapsteak/tbd
4•cheapsteak•6h ago•0 comments

Show HN: Puzzle with Strangers. A free multiplayer jigsaw

https://endtime-instruments.org/puzzle/
4•janoelze•6h ago•4 comments

Show HN: Brain Frog – Can you be random enough for 11 lines of JavaScript?

https://brainfrog.lol
48•AlexanderZ•1w ago•44 comments

Show HN: Bible as RAG Database

https://www.crosscanon.com/
152•jacksonastone•1d ago•90 comments

Show HN: Jargo – a Golang port of Pipecat for conversational-AI apps

https://github.com/gojargo/jargo
5•fallais•10h ago•0 comments

Show HN: `uvx ptn` and expose any system to agents (dangerously)

https://pypi.org/project/ptn/
3•yxl448•7h ago•0 comments

Show HN: StartupsBR – A map of Brazilian startups

https://www.startupsbr.com/sao-paulo
54•leonagano•1w ago•26 comments

Show HN: ZeroGate – API gateway to scale cloud GPUs to zero when idle

https://github.com/noah-garner/zerogate
5•ngarner•9h ago•0 comments

Show HN: Monolisa v3 – a typeface for developers and creatives

https://www.monolisa.dev/
189•bebraw•4d ago•92 comments

Show HN: I built a tool that matches Founders to VCs based on their pitch deck

https://investormatch.pro
3•MartinTobias•10h ago•0 comments

Show HN: Ponder – the best articles and essays on the internet

https://www.readponder.com/
3•wingdiction•10h ago•2 comments

Show HN: Motif Atlas – recurring patterns behind complex systems

https://nikitph.github.io/motifs/
8•loaderchips•15h ago•2 comments

Show HN: Nimic – Pure Python as a systems language with AOT compilation

https://github.com/dima-quant/nimic
42•dima-quant•3d ago•27 comments

Show HN: Wordit – Change One Letter, Keep the Chain Going

https://victorribeiro.com/wordit/
43•atum47•3d ago•28 comments

Show HN: Nub – A Bun-like all-in-one toolkit for Node.js

https://github.com/nubjs/nub
271•colinmcd•2d ago•80 comments