Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.
Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.
What I’m curious about: ∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?) ∙ How well does it work? What are the gaps? ∙ What scale are you at? (company size, # of instances, requests/sec)
I’d love to hear what’s working, what isn’t, and what you wish existed.
toast0•11h ago
If that's not enough to come back from an outage, you need to put in load shedding and/or back pressure. There's no sense accepting all the requests and then not servicing any in time.
You want to be able to accept and do work on requests that are likely to succeed within reasonable latency bounds, and drop the rest --- but being careful that an instant error may feed back into retry storms, sometimes it's better if such errors come after a delay, so that the client is stuck waiting (back pressure)
rjpruitt16•2h ago