Common failure modes I'm thinking about: - 429 rate limits → agents retry → hammer API worse - Partial outages → synchronized retries across customers - LangGraph workflows fail mid-execution → how to resume?
For those running agent systems at scale: - How do you handle Layer 7 failures? - Retry coordination? Circuit breakers? - How do you prevent retry storms to downstream dependencies? - Do LangGraph workflows gracefully handle API failures?
Curious what the production reality looks like.