Show HN: Dendrite – O(1) KV cache forking for tree-structured LLM inference

3•RyeCatcher•1h ago

Comments

RyeCatcher•1h ago

Hey HN, author here. Happy to answer questions.

Why Rust? - Ownership model makes the CoW block table provably safe, the borrow checker enforces that you can't alias a block that's being written. That's not just style; it eliminates a whole class of correctness bugs that plague Python/C++ inference engines.

How is this different from vLLM's paged attention? - vLLM pages memory in fixed blocks to avoid fragmentation. Dendrite does that too, but adds O(1) KV cache forking via copy-on-write. When you branch a beam or MCTS node, you get a shallow pointer copy (~500ns) instead of copying the full KV cache. The deeper the tree, the bigger the win.

TurboQuant: Google published this last week (ICLR 2026). We already have a Rust implementation, PolarQuant + QJL pipeline in `cache/compress.rs`. Measured 3x compression at head_dim=128 on CPU; paper claims 6x with per-head grouping (coming).

Status: This is research-grade, not production. No Python bindings yet (tracked in issues), no FlashAttention kernels. Best fit today: if you're building tree-structured search (MCTS, beam, speculative decoding) and want to control the stack.