The provider returned 429s for a short period. We had per-call retry limits in place. We did NOT have containment at the request-chain level.
Because calls were nested and spread across multiple workers, retries multiplied in ways we didn’t anticipate.
Per-call limits were not enough.
For those running LLM systems in production:
– Do you implement chain-level retry budgets? – Shared circuit breaker state? – Per-minute cost ceilings? – Cost-based limits (tokens/$) rather than retry count? – Or is exponential backoff usually sufficient in practice?
I’m trying to understand what actually works at scale, beyond monitoring dashboards that tell you after the fact.