Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.
Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.
What I’m curious about: ∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?) ∙ How well does it work? What are the gaps? ∙ What scale are you at? (company size, # of instances, requests/sec)
I’d love to hear what’s working, what isn’t, and what you wish existed.