CacheShrink is a KV cache compression library I built using Riemannian optimization on Stiefel manifolds to decompose key-value matrices into latent representations and reconstruct them efficiently. It works with HuggingFace and supports multiple attention styles including MHA (Multi-Head Attention) and GQA (Grouped Query Attention), using XKV-style compression for GQA. The approach achieves significant memory reduction with minimal loss in perplexity.
bell-cot•1h ago
Not an HN staffer...but maybe read this and update your title?
Kranium2002•1h ago