The input side is manageable — you can count tokens before sending. But output tokens are essentially unknowable upfront, and with agents that chain multiple calls (tool use, multi-turn reasoning, retries on failure), a single user action might be 3 API calls or 40. Multiply that by prompt caching behavior (which is great when it hits, but you can't always guarantee it), and the cost variance per task can easily be 10-20x.
This makes it really hard to do basic things like: set pricing for an AI-powered feature, decide whether an approach is even economically viable before building it, or give finance any kind of credible forecast.
What I've tried/looked at so far:
- Anthropic's token counting endpoint gives you exact input token counts pre-flight, which helps, but doesn't solve the output/chaining problem
- Logging everything post-hoc and building up averages per workflow — works but you're already committed by that point
- Setting hard spend caps at the API level — blunt instrument, doesn't help with per-feature attribution
- Looked at various OSS tools (ccusage, Langfuse, Helicone) — mostly retrospective dashboards, good for what did I already spend but not what will I spend
How are you handling this, especially if you're running agent-heavy workloads or building products where AI cost is a meaningful part of COGS. Are you doing any kind of pre-flight estimation? Cost-aware routing between models? Or just building first and optimizing later?
verdverm•2h ago