We were the first to introduce post-rotation distribution aware quantization in 21, which was LATER implemented in many fields including federated learning, vector retrieval, databases and KV-cache.
It would be nice to get some credit of this. And it is certainly baffling to see the name "TurboQuant" repeated in this context, considering the many works from 21 onwards.
This blog post basically goes you through EDEN quantization, but then ends with showing a less than optimal MSE-minimizing version and an unbiasing trick that is less optimal by a full bit in many cases.
linuxhansl•51m ago
Maybe we won't need as many data centers and as much power as we thought. Maybe we can run more powerful models locally.
everythingctl•36m ago
I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center.