Reuse non-prefix KV Cache and speed up RAG by 3X with LMCache

https://github.com/LMCache/LMCache-Examples/blob/main/demo-rag-blending/README.md

5•lihanc111•7h ago

Comments

lihanc111•7h ago

Hey HN Community!

A while back, we shared our open-source project LMCache here and were blown away by the incredible support and feedback. Today, our team is thrilled to share more about one of our core components: CacheBlend. Recognized with a Best Paper Award at ACM EuroSys 2025, this technique is a pain killer for efficient RAG applications The Problem: Your KV Cache is Wasting Potential In modern LLM applications like RAG and Agents, we constantly feed the model new context. For example, in RAG, we retrieve relevant documents and stuff them into the prompt.

The issue is that this dynamically retrieved context doesn't always appear at the beginning of the input sequence. Traditional KV caching only reuses a "common prefix," so if the new information isn't at the very start, the cache hit rate plummets, and your GPU ends up recomputing the same things over and over. The Solution: CacheBlend - 100% Hit Rate, No Compromises CacheBlend changes the game by allowing for the reuse of pre-computed KV caches regardless of their position in the input sequence.

This means we can finally achieve a 100% KV Cache hit rate in applications like RAG. The performance gains are significant:

Faster Time-To-First-Token (TTFT): Get your initial response much quicker.

More Throughput: Serve significantly more users with the same hardware.

Almost lossless Output Quality: All of this is achieved with little degradation in the model's generation quality.

How does it work? CacheBlend intelligently handles the two main challenges of reusing non-prefix caches: Positional Encoding Update: It efficiently updates positional encodings to ensure the model always knows the correct position of each token, even when we're stitching together cached and new data.

Selective Attention Recalculation: Instead of recomputing everything, it strategically recalculates only the minimal cross-attention needed between the new and cached chunks to maintain perfect generation quality.

For detailed analysis, please refer to the official paper: https://dl.acm.org/doi/10.1145/3689031.3696098

Where can I try it? Our official repo is at: https://github.com/LMCache/LMCache The newest interactive CacheBlend demo at: https://github.com/LMCache/LMCache-Examples/tree/main/demo-r...

Ask us anything!

Our Missing Pieces

Claude Code Down – Auth Issues

CatchIdeas – Find High-Traffic Keywords for Product and Content Ideas

Fact Sheet: Autism Prevalence

No Tax on Overtime Calculator

V0 Platform API now in beta

Research suggests electricity markets are using suboptimal pricing

Thoughts on Motivation and My 40-Year Career

Learning in living mice defies classic synaptic plasticity rule

Doctest is a new C++ testing framework

Most people who buy your game won't play it

The #1 Reason Your GenAI Project Will Fail in Production

Andreessen Horowitz Leaves Delaware for Nevada, Tells Startups to Follow

Concorde – The 24 Hour World (1973) [video]

Bug report forms powered by AI – No more duplicates, spam or lackluster reports

A warning to sword-makers, and sword buyers

Firnas: AI Native Travel for Business

Nvidia Became the First $4T Company

PoPo: MMD Anime Char Model Pose Generation Using Fine Tuned LLM

Army tests robotic coyotes to defend fighter jets from wildlife

Music for Heathrow

AI Can't Take over Soon Enough for Me

Using Protobuf to make Jira Cloud faster

Dépanneurs

The first time I was visited by the FBI [video]

I built a Reddit lead gen tool that gives you usernames in 30 seconds

Cloudflare forwarding changes causing authenticated emails to be rejected

Show HN: A generative audio VST plugin using Gemini API, JUCE, and React

MicroHs, a tiny Haskell Compiler [video]

Apple COO Jeff Williams stepping down later this month