"Of course, in the short term, there’s a whole host of caveats: you need an LLM, likely a GPU, all your data is in the context window (which we know scales poorly), and this only works on text data."
Okay, that's not fair. There's a big advantage to having an external compressor and reference file whose bytes aren't counted, whether or not your compressor models knowledge.
More importantly, even with that advantage it only wins on the much smaller enwiki8. It loses pretty badly on enwiki9.
There is no unfair advantage here. This was also achieved in the 2019-2021 period; it feels safe to say that Bellard could have likely pushed the frontier far further with modern compute/techniques.
The benchmark in question (Hutter prize) does count the size of the decompressor/reference file (as per the rules, the compressor is supposed to produce a self-decompressing file).
The article mentions Bellard's work but I don't see his name in the top contenders of the prize, so I'm guessing his attempt was not competitive enough if you take into account the LLM size, as per the rules.
PaulHoule•2mo ago
N_Lens•2mo ago