1. You send your prompt, and now adays, whatever harness you're using sends a whole mess of context: available skills, tools, guardrails, etc. The GPU/inference engine starts processing it into tokens. This is the "Prompt Processing" speed and it's the fastest portion of inference, but is essentially "buffering" (text -> tokens). These tokens can be cached.
2. The inference then generates, more slowly, the next tokens; these I think are cached also (tokens -> text)
Crucially: the KV cache is the _hardware_ cache; it is not a software layer currently, and even if it were, that'd make it extremely slow because it's storing _all_ the tokens in a conversation. So like all cache, cache eviction has to occur to free up the VRAM necessary.
So if you had a conversation an hour ago, in the cloud, it's doubtful any of those tokens still exist so if you got up to 500k, you're going through step #1 again; if you're doing turn by turn immediately, you can skip to #2.
So some of the reports in March about suddenly all the token gen allowance disappearing within hours was likely a KV cache/billing issue: they were charging you as if you were generating all those tokens for every back and forth. Whether it was a bug in billing vs a bug in programming, who knows.
The trouble is that the traditional webserver type of proxy caching & load balancing tricks that helped scale the web don't work here! Your conversation with 100k context has to return to the same cluster, maybe even the same GPU to rely on the extraordinary fast KV cache reuse.
htk•29m ago
chuzz•24m ago