frontpage.

Tq-KV – Rust implementation of TurboQuant that works on GGUF models

2•onurgokyildiz•1h ago

TurboQuant came out at ICLR on March 25. We tried every available implementation on GGUF models. None of them produced usable output. Perplexity goes from 5.18 to 3,556. The model starts mixing languages mid-sentence, hallucinating citations, losing coherence entirely. It's compound quantization error. GGUF models already have quantized weights. Quantize the KV cache on top of that, and the errors multiply through softmax. Nobody was handling this. So we wrote our own from scratch. 13.7K lines of Rust, 86 tests. The first thing we tried was embarrassingly simple: keep sink tokens in FP16 (they anchor attention), skip quantizing the current token, and hard-reset cache between conversations. We called it the 3-Fix Framework mostly as a joke because each fix is so obvious. But together they take PPL from 3,556 to 6.07. The cache reset was the one that surprised us -- turns out quantization errors were silently accumulating across conversations, and the model would drift after a few turns. The bigger win was compressing keys before rotary position embedding instead of after. Post-RoPE keys have position-dependent statistics that break the Gaussian assumption Lloyd-Max codebooks rely on. Pre-RoPE keys don't have this problem. PPL gap drops from +17% to +3.7%. We almost didn't try this because it meant restructuring the entire compression pipeline. Glad we did. We also integrated KV Compaction (Zweiger 2026), which is orthogonal to bit compression -- TurboQuant reduces bits per token, Compaction reduces number of tokens. You select important keys using attention scores from all GQA-mapped query heads, fit biases to preserve the softmax partition function, then solve for synthetic values via ridge regression. Our first attempt used a single reference query and performed terribly. Switching to all query heads with mean-based scoring fixed it -- PPL went from 5.78 to 2.23 on the same config. Combined with TurboQuant: up to 25x effective compression. On the systems side, we replaced dense QJL with a structured Hadamard variant that's 115x faster and somehow also better quality (+4.5 dB SNR -- we still don't fully understand why). Fused attention computes scores directly from compressed indices using AVX2+FMA centroid lookup (6-8.9x speedup). And we built an append-only O(1) incremental cache, because the naive approach recompresses the entire cache on every token, which at 4K context means 935x overhead. Numbers on Qwen 2.5 7B Q4_K_M at 4-bit: +3.7% PPL delta with Pre-RoPE (other impls give +17% at best with our 3-Fix, or 3,556 without). 9/9 needle-in-a-haystack. 7.5-14.2x VRAM reduction. With compaction, effective compression reaches 25x. Tested across Llama-3, Qwen2.5, Gemma, Mistral, and Phi-3. There's also tq-engine on top of this -- model hub, HTTP API, calibration pipeline. Basically a Rust Ollama with KV compression built in. On crates.io: https://crates.io/crates/tq-kv gh: https://github.com/onur-gokyildiz-bhi/tq-kv

Axios NPM Package Supply Chain Hack

Almighty Lisp

Claude Code in Rust, Python, Go, Open source

New Patches Allow Building Linux IPv6-Only, Option to Deprecate "Legacy" IPv4

A BASIC interpreter in Swift for Apple's birthday

Google Paper Warns of Quantum Computing Risk for Bitcoin

Why AI systems improve while drifting away from reality [pdf]

Review: Measuring AI Ability to Complete Long Software Tasks

Is BGP Safe Yet? No. Test Your ISP

Refactoring Is Not Heroism – An Information-Theoretic Proof

TSMC plans 3-nanometre chip production launch in Japan in 2028

Show HN: OpenHarness Open-source terminal coding agent for any LLM

Office air quality may affect employees' cognition, productivity

Nearly a decade building a VR studio: Why I left Unity, what I found in Unreal

'Terrible pollution': the reality of the US gas sites rated 'grade A'

RFK Jr. wants Americans to use peptides that were banned over safety risks

Coreutils: A Comprehensive Review (2023)

NASA is leading the way to the Moon, but the military won't be far behind

Isomorphic Layout Composer – Microservice architecture on the front-end

Alcazar Security: Dead man's switch and digital legacy

Matt Mullenweg Calls for WordPress 7.0 Delay to Introduce Database Table for RTC

AI Policy

Cookbooks for Aliens

Backstage Is Dead

Matt Miller: Tech sovereignty is 'welfare' for weak startups

Mercor Is Compromised

Applying Federal Digital Communications Privacy Law to Hosted AI Models

Claude Code ranks 39th on terminal bench. The leaked source shows why

CEO of largest public hospital says he's ready to replace radiologists with AI

Claude Code leak reveals a persistent background agent 'KAIROS'