> Place rich UI elements within tables, lists, or other markdown elements when appropriate.
Which is to say you wouldn't want to bake such a thing too deeply into a multi-terabyte bunch of floating points because it makes operating things harder
These are NOT included in the model context size for pricing.
Does inference need to process this whole thing from scratch at the start of every chat?
Or is there some way to cache the state of the LLM after processing this prompt, before the first user token is received, and every request starts from this cached state?
https://platform.openai.com/docs/guides/prompt-caching
It's fairly simple actually. Each machine stores the KV cache in blocks of 128 tokens.
That's stored in a prefix tree like structure. Probably with some sort of LRU eviction policy.
If you ask a machine to generate it does so starting from the longest matching sequence in the cache.
They route between racks using a hash of the prefix.
Therefore the system prompt, being frequently used and at the beginning of the context, will always be in the prefix cache.
TZubiri•6mo ago
NewsaHackO•6mo ago
TZubiri•6mo ago
And their work is literally "DON'T do this, DO that in these situations"
sellmesoap•5mo ago
TZubiri•5mo ago