ContextCache compiles tool schemas into a KV cache once and reuses it across all requests. Only the user query goes through prefill.
Results (Qwen3-8B, RTX 3090 Ti): - 50 tools: 5,625ms → 193ms (29.2x speedup) - Zero quality degradation (TSA 0.850 matches full prefill exactly)
Also includes a CPU-only orchestrator (no GPU needed) using llama.cpp + Qwen3.5-2B that routes queries to the right tool in ~550ms. Works with any LLM backend — Ollama, Claude, OpenAI, xAI, DeepSeek, Groq, or self-hosted.
Two products from one project: - Route-only (~500ms): just tool detection, no LLM needed - Full pipeline (~3s): route → extract params → execute → synthesize
Open source (CC BY 4.0), paper included.