Lossless LLM 3x Throughput Increase by LMCache

103•lihanc111•4d ago

Comments

lihanc111•4d ago

Our team has built this open source project, LMCache, to reduce repetitive computation in LLM inference and make systems serve more people (3x more throughput in chat applications) and it has been used in IBM's open source LLM inference stack.

In LLM serving, the input is computed into intermediate states called KV cache to further provide answers. These data are relatively large (~1-2GB for long context) and are often evicted when GPU memory is not enough. In these cases, when users ask a follow up question, the software needs to recompute for the same KV Cache. LMCache is designed to combat that by efficiently offloading and loading these KV cache to and from DRAM and disk.

Ask us anything!

dist-epoch•4h ago

How is it possible to do non-prefix KV cache? I was under the impression that the V for one token potentially depends on the V of all previous ones.

da-x•4h ago

Yes, there's KV cache 'Blending' see [1].

Future versions of LMCache are aiming to support this.

[1] CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion- https://arxiv.org/abs/2405.16444

pama•2h ago

Is your aim targetting the inference at scale or specialized/new/simpler inference pipelines? Sglang and vllm have disaggregated prefix and decoding serving (eg https://docs.vllm.ai/examples/online_serving/disaggregated_s... or https://github.com/sgl-project/sglang/issues/3554 and https://github.com/sgl-project/sglang/issues/4655) — could your solution enable a model-agnostic cache store/server or is that orthogonal to what you are trying to achieve?

nativeit•1h ago

Has it been used in IBM's inference stack, or used with IBM's inference stack? In other words, has this been merged into IBM's own repositories, or has someone just tested it using them?

behnamoh•1h ago

> Our team

So this is something that might in the future turning to a commercial product? something like Langchain and thousands of open source projects that started as "open source" but then ended up implementing proprietary features for a cost.

0xjunhao•4d ago

Hi, I had a quick question. Would it be correct to say the following?

1. For long inputs and short outputs, the inference can be arbitrarily number of times faster, as it avoids repeated KV computation.

2. Conversely, for short inputs and long outputs, it might be slightly slower, since loading and storing the KV cache are on the critical path of the execution.

iLoveOncall•4h ago

Is this any different than prompt caching?

smcleod•4h ago

Have you considered integrating it with the likes of llama.cpp?

m3kw9•3h ago

How would it work if a user wants to do 1 of n tries?

kcorbitt•2h ago

Looks cool! With vLLM v1, prefix caching is enabled by default and seems quite performant. Is the advantage of LMCache the fact that you can offload to CPU and disk as well? How much is throughput/latency affected if you need to pull a large KV cache from disk/cpu instead of GPU RAM?

Also, how realistic would it be to share the KV cache across vllm nodes within a data center? It would be really nice to be able to freely distribute requests to a pool of vLLM workers without worrying about prefix-aware routing, but maybe that isn't the right approach because moving the KV cache around would be too slow?

guywhocodes•6m ago

This is exactly what llm-d is

ekianjo•2h ago

wasn't this already implemented in llama.cpp?

sgammon•2h ago

Hey LMCache team! Saw you guys at OSS N.A. but wasn’t able to set aside time to say hello. We’d love to chat about collaborating. Is there an email we can reach out to?

refulgentis•2h ago

Word to the wise:

"Lossless 3x Throughput Increase" == "Cache all inputs and output across everyone, in RAM and on disk, and if you assume the next request is covered by cache, its 3x faster!"

I'm more surprised it's only advertised as 3x under those conditions: my llama.cpp wrapper does the same -- caching in RAM while running locally seems fine to me -- and when input is cached, TTFT is ~instantaneous, modulo any add'l prompt you add.

I supposed it creates a little more distance, in that, instead of infinity times faster for latency, we measure throughput, and then our speedup can be adjusted as desired by adjusting output length, and thus we can pick a more reasonable-sounding metric like 3x. (though, the GitHub README still frames it in terms of latency / TTFT)

varispeed•2h ago

Sometimes I think the entire engineering profession collectively underwent a lobotomy. Techniques like caching partial computation results to avoid repeating expensive work were so basic a few decades ago that no one would have bothered to dignify them with a paper, let alone brand them with a fancy acronym and announce them like the second coming of Turing. Now we get breathless blog posts and community calls over the mind-blowing discovery that storing KV caches of repeated text speeds things up. Next we'll get a paper on using hash tables to look things up faster. Meanwhile, actual difficult problems in large-scale distributed inference and model interpretability get hand-waved so we can posture about reinventing memoisation. Tech never fails to take the obvious, put a bow on it, and sell it back to us as groundbreaking.

vlovich123•1h ago

Partial caching as a concept doesn’t matter. The hard part is figuring out how to make it work for cross attention which sets up a data dependency for every entry on every preceding entry. So prefix caching of KV cache is brain dead easy. Computing a KV cache for random bits of text and then combining unrelated text in a way that makes the LLM still work coherently and correctly? That to me seems much harder.

It seems to me like you’re easily hand waving away a hard problem in a different part of the stack you’re less familiar with.

nativeit•2h ago

It seems odd to me that so many of these projects are being launched by people who have only just discovered and/or joined HN. I'm worried this is just becoming LinkedIn for AI opportunists.

parpfish•1h ago

I’ve got a side project that I may (someday) do a show HN with. However, I’d probably make a new account for that because the project is connected to my real name/portfolio and I don’t want that connected with my pseudonymous comments here

nativeit•1h ago

I considered that, but then why would anyone obfuscate this really very reasonable scenario by choosing another ostensibly pseudonymous username?

fsmv•1h ago

[deleted]

parpfish•1h ago

I imagine that this is a common problem and it could be another cool “unlockable” on HN, like the downvotes at 500 karma.

Once you get X karma or account age >Y years, you can make one anonymous submissions each quarter that comes from an non-user but still get some sort of “verified” badge that proves it comes from a legit user.

refulgentis•1h ago

You nailed it IMHO.

I quit my job at Google 2 years ago to do LLM stuff, was looking forward to having HN around, but discussions re: LLMs here are a minefield.

Why?

Everyone knows at least a little, and everyone has a strong opinion on it given the impact of it. People sharing stuff sell it way high, and as with any new thing where people are selling, there's a lot of skeptics. Then, throw in human bias towards disliking what seems like snark / complaining, so stuff with substance gets downvotes.

SNR ratio is continually decreasing.

Let's dig into why this one is weird:

My work inferences using either 3P provider, which do caching, or llama.cpp, in which I do caching. (basically, picture it as there's a super expensive step that you can skip by keeping Map<input string, gpu state>)

So I log into HN and see this and say to myself: 3x! throughput increase? This is either really clever or salesmanship, no way an optimization like that has been sitting around on the groud.

So I read the GitHub, see it's just "write everyones inputs and outputs to disk, you can then use them to cobble together what the GPU state would be for an incoming request!", and write a mostly-polite comment below flagging "hey, this means writing everything to disk"

Then I start replying to you...but then I throw away the comment, because I'm inviting drive-by downvotes. I.e. the minefield describe up top, and if you look like you're being mean, you'll eat downvotes, especially on a weekend.

And to your average reader, maybe I just don't understand vLLM, and am taking it out in good hackers just pushing code.

Then, when I go back, I immediately see a comment from someone who does use vLLM noting it already does caching.

Sigh.

nativeit•54m ago

Thanks for sharing. You certainly aren't alone in your sentiments. I am seeing similar trends in arXiv submissions, as it seems it has become something of a means to inflate the value of one's own product(s) with a veneer of academic rigor. There seems to be a S.O.P. emerging for AI tools that follows many of the same trends as the less-than-reputable blockchain/crypto projects.

pama•17m ago

I had related questions and checked out the project a bit deeper though I havent tested it seriously yet. The project did start work over a year ago based on relevant papers, before vllm or sglang had decent solutions; it might still be adding performance in some workflows though I havent tested it and some of the published measurements in the project are now stale. Caching LLM kv-cache to disk or external memory servers can be very helpful at scale. Cache management and figuring out cache invalidation is hard anyways and I am not sure at what level a tight integration with inference servers or specialized inference popelines can help vs a lose coupling that could advance each component separately. It would be nice if there were decent protocols used by all inference engines to help this decoupling.

nativeit•1h ago

I'll just be unambiguous about this:

> Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.

https://news.ycombinator.com/newsguidelines.html

wg0•1h ago

Seems like snake oil to me. I mean lacks clear explanation of how exactly it works if at all.

ahmedhawas123•1h ago

Like this a lot and thanks for making it open source. Does this support ollama today? I only saw vLLM

jbentley1•49m ago

Is this the same as the prompt caching that other API's (Anthropc, OpenAI, etc) have had, just open source and for vLLM?

MCP: An (Accidentally) Universal Plugin System

BusyBeaver(6) Is Quite Large

We ran a Unix-like OS Xv6 on our home-built CPU with a home-built C compiler

Lima Site 85: How a CIA Helicopter Defended a Secret U.S. Radar Facility

Unheard works by Erik Satie to premiere 100 years after his death

Jane Street's sneaky retention tactic

Addictions Are Being Engineered

Parsing JSON in Forty Lines of Awk

Republican governors oppose 10-year moratorium on state AI laws in GOP tax bill

Show HN: I'm an airline pilot – I built interactive graphs/globes of my flights

Lossless LLM 3x Throughput Increase by LMCache

History of Cycling Maps

ZeQLplus: Terminal SQLite Database Browser

Sirius: A GPU-native SQL engine

Engineer creates ad block for the real world with augmented reality glasses

LLMs Bring New Nature of Abstraction

Lago (Open-Source Usage Based Billing) is hiring for ten roles

JWST reveals its first direct image discovery of an exoplanet

After successfully entering Earth's atmosphere, a European spacecraft is lost

Verifiably Correct Lifting of Position-Independent x86-64 Binaries (2024)

No One Is in Charge at the US Copyright Office

C++ Seeding Surprises (2015)

Boeing uses potatoes to test wi-fi

I deleted my second brain

Arrests of scientists over smuggled samples add to US border anxiety

Reinforcement learning, explained with a minimum of math and jargon

Microsoft extends free Windows 10 security updates into 2026

Untangling Lifetimes: The Arena Allocator

Normalizing Flows Are Capable Generative Models

London's largest ancient Roman fresco is “most difficult jigsaw puzzle”

Lossless LLM 3x Throughput Increase by LMCache

Comments

MCP: An (Accidentally) Universal Plugin System

BusyBeaver(6) Is Quite Large

We ran a Unix-like OS Xv6 on our home-built CPU with a home-built C compiler

Lima Site 85: How a CIA Helicopter Defended a Secret U.S. Radar Facility

Unheard works by Erik Satie to premiere 100 years after his death

Jane Street's sneaky retention tactic

Addictions Are Being Engineered

Parsing JSON in Forty Lines of Awk

Republican governors oppose 10-year moratorium on state AI laws in GOP tax bill

Show HN: I'm an airline pilot – I built interactive graphs/globes of my flights

Lossless LLM 3x Throughput Increase by LMCache

History of Cycling Maps

ZeQLplus: Terminal SQLite Database Browser

Sirius: A GPU-native SQL engine

Engineer creates ad block for the real world with augmented reality glasses

LLMs Bring New Nature of Abstraction

Lago (Open-Source Usage Based Billing) is hiring for ten roles

JWST reveals its first direct image discovery of an exoplanet

After successfully entering Earth's atmosphere, a European spacecraft is lost

Verifiably Correct Lifting of Position-Independent x86-64 Binaries (2024)

No One Is in Charge at the US Copyright Office

C++ Seeding Surprises (2015)

Boeing uses potatoes to test wi-fi

I deleted my second brain

Arrests of scientists over smuggled samples add to US border anxiety

Reinforcement learning, explained with a minimum of math and jargon

Microsoft extends free Windows 10 security updates into 2026

Untangling Lifetimes: The Arena Allocator

Normalizing Flows Are Capable Generative Models

London's largest ancient Roman fresco is “most difficult jigsaw puzzle”