Why we do this: latency. A 3b parameter model, especially when quantized, can deliver sub-100ms time-to-first-token and generate a complete function call in under 100ms. That makes the LLM “disappear” as a bottleneck, so the only real waiting time is in the external tool or API being called + the time it takes to synthesize a human readable response.