KV Cache Is Becoming the Memory Hierarchy of Inference

https://touchdown-labs.com/blog/kv-cache-memory-hierarchy-inference.html

28•matt_d•2d ago

Comments

htk•29m ago

Hard to read article. The writing is curiously more robotic and repetitive than those written by AI.

chuzz•24m ago

i like how even if i can parse most of it it does sound like technically accurate technobabble, could be of inspiration for a tv show :D

tptacek•11m ago

There's like an interesting systems article here, but at this point I'd rather they just gave me the prompt they used to generate it, so I can read it interactively in my own GPT5.5 session.

cyanydeez•7m ago

ok, so for anyone whose not played with local models and watched what's going on with the KV cache:

1. You send your prompt, and now adays, whatever harness you're using sends a whole mess of context: available skills, tools, guardrails, etc. The GPU/inference engine starts processing it into tokens. This is the "Prompt Processing" speed and it's the fastest portion of inference, but is essentially "buffering" (text -> tokens). These tokens can be cached.

2. The inference then generates, more slowly, the next tokens; these I think are cached also (tokens -> text)

Crucially: the KV cache is the _hardware_ cache; it is not a software layer currently, and even if it were, that'd make it extremely slow because it's storing _all_ the tokens in a conversation. So like all cache, cache eviction has to occur to free up the VRAM necessary.

So if you had a conversation an hour ago, in the cloud, it's doubtful any of those tokens still exist so if you got up to 500k, you're going through step #1 again; if you're doing turn by turn immediately, you can skip to #2.

So some of the reports in March about suddenly all the token gen allowance disappearing within hours was likely a KV cache/billing issue: they were charging you as if you were generating all those tokens for every back and forth. Whether it was a bug in billing vs a bug in programming, who knows.

The trouble is that the traditional webserver type of proxy caching & load balancing tricks that helped scale the web don't work here! Your conversation with 100k context has to return to the same cluster, maybe even the same GPU to rely on the extraordinary fast KV cache reuse.

I’ve built a virtual museum with nearly every operating system you can think of

Apple unveils new accessibility features

I’ve joined Anthropic

KV Cache Is Becoming the Memory Hierarchy of Inference

Gaussian Splat of a Strawberry

Gentoo News: Copy Fail, Dirty Frag, and Fragnesia Kernel Vulnerabilities

Gemini 3.5 Flash: frontier intelligence with action

Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs

CISA Admin Leaked AWS GovCloud Keys on GitHub

Intro to TLA+ for the LLM Era: Prompt Your Way to Victory

Hanoi’s humble beer glass and the memory of a nation

Gemini Omni

I Found Ultra-Pure Quantum Crystals in an Abandoned Mine in the Atacama Desert

The last six months in LLMs in five minutes

Mini Shai-Hulud Strikes Again: 314 npm Packages Compromised

KV Sharing, MHC, and Compressed Attention

Peter Neumann has died

Google Search as you know it is over

Show HN: I made a 3D pose maker for artists

An Apple (II) for Teacher

Show HN: Haystack – Review the PRs that need human attention

Deciphering the Hashihara Castle Town Map

Google I/O

Polypad

OpenBSD 7.9

Cursor Introduces Composer 2.5

Kv4p HT – A homebrew 1W radio (VHF or UHF) that plugs into an Android phone

AI, "Humanity", and Dr. Manhattan Syndrome: A Communications Intervention

Click (2016)

Nim-Presto – REST API Framework for Nim Language (2024)

KV Cache Is Becoming the Memory Hierarchy of Inference

Comments

I’ve built a virtual museum with nearly every operating system you can think of

Apple unveils new accessibility features

I’ve joined Anthropic

KV Cache Is Becoming the Memory Hierarchy of Inference

Gaussian Splat of a Strawberry

Gentoo News: Copy Fail, Dirty Frag, and Fragnesia Kernel Vulnerabilities

Gemini 3.5 Flash: frontier intelligence with action

Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs

CISA Admin Leaked AWS GovCloud Keys on GitHub

Intro to TLA+ for the LLM Era: Prompt Your Way to Victory

Hanoi’s humble beer glass and the memory of a nation

Gemini Omni

I Found Ultra-Pure Quantum Crystals in an Abandoned Mine in the Atacama Desert

The last six months in LLMs in five minutes

Mini Shai-Hulud Strikes Again: 314 npm Packages Compromised

KV Sharing, MHC, and Compressed Attention

Peter Neumann has died

Google Search as you know it is over

Show HN: I made a 3D pose maker for artists

An Apple (II) for Teacher

Show HN: Haystack – Review the PRs that need human attention

Deciphering the Hashihara Castle Town Map

Google I/O

Polypad

OpenBSD 7.9

Cursor Introduces Composer 2.5

Kv4p HT – A homebrew 1W radio (VHF or UHF) that plugs into an Android phone

AI, "Humanity", and Dr. Manhattan Syndrome: A Communications Intervention

Click (2016)

Nim-Presto – REST API Framework for Nim Language (2024)