Tq-KV – Rust implementation of TurboQuant that works on GGUF models
2•onurgokyildiz•1h ago
TurboQuant came out at ICLR on March 25. We tried every available implementation on GGUF models. None of them produced usable output. Perplexity goes from 5.18 to 3,556. The model starts mixing languages mid-sentence, hallucinating citations, losing coherence entirely.
It's compound quantization error. GGUF models already have quantized weights. Quantize the KV cache on top of that, and the errors multiply through softmax. Nobody was handling this.
So we wrote our own from scratch. 13.7K lines of Rust, 86 tests.
The first thing we tried was embarrassingly simple: keep sink tokens in FP16 (they anchor attention), skip quantizing the current token, and hard-reset cache between conversations. We called it the 3-Fix Framework mostly as a joke because each fix is so obvious. But together they take PPL from 3,556 to 6.07. The cache reset was the one that surprised us -- turns out quantization errors were silently accumulating across conversations, and the model would drift after a few turns.
The bigger win was compressing keys before rotary position embedding instead of after. Post-RoPE keys have position-dependent statistics that break the Gaussian assumption Lloyd-Max codebooks rely on. Pre-RoPE keys don't have this problem. PPL gap drops from +17% to +3.7%. We almost didn't try this because it meant restructuring the entire compression pipeline. Glad we did.
We also integrated KV Compaction (Zweiger 2026), which is orthogonal to bit compression -- TurboQuant reduces bits per token, Compaction reduces number of tokens. You select important keys using attention scores from all GQA-mapped query heads, fit biases to preserve the softmax partition function, then solve for synthetic values via ridge regression. Our first attempt used a single reference query and performed terribly. Switching to all query heads with mean-based scoring fixed it -- PPL went from 5.78 to 2.23 on the same config. Combined with TurboQuant: up to 25x effective compression.
On the systems side, we replaced dense QJL with a structured Hadamard variant that's 115x faster and somehow also better quality (+4.5 dB SNR -- we still don't fully understand why). Fused attention computes scores directly from compressed indices using AVX2+FMA centroid lookup (6-8.9x speedup). And we built an append-only O(1) incremental cache, because the naive approach recompresses the entire cache on every token, which at 4K context means 935x overhead.
Numbers on Qwen 2.5 7B Q4_K_M at 4-bit: +3.7% PPL delta with Pre-RoPE (other impls give +17% at best with our 3-Fix, or 3,556 without). 9/9 needle-in-a-haystack. 7.5-14.2x VRAM reduction. With compaction, effective compression reaches 25x. Tested across Llama-3, Qwen2.5, Gemma, Mistral, and Phi-3.
There's also tq-engine on top of this -- model hub, HTTP API, calibration pipeline. Basically a Rust Ollama with KV compression built in. On crates.io: https://crates.io/crates/tq-kv gh: https://github.com/onur-gokyildiz-bhi/tq-kv