Ask HN: Is anyone losing sleep over retry storms or partial API outages?

2•rjpruitt16•13h ago

I’m working on infrastructure to solve retry storms and outages. Before I go further, I want to understand what people are actually doing today. Compare solutions and maybe help someone see potential solutions. The problems:

Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.

Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.

What I’m curious about: ∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?) ∙ How well does it work? What are the gaps? ∙ What scale are you at? (company size, # of instances, requests/sec)

I’d love to hear what’s working, what isn’t, and what you wish existed.

Comments

toast0•11h ago

Retry storms are "easy" exponential backoff with jitter. Like what ethernet on shared media has been doing since the 80s.

If that's not enough to come back from an outage, you need to put in load shedding and/or back pressure. There's no sense accepting all the requests and then not servicing any in time.

You want to be able to accept and do work on requests that are likely to succeed within reasonable latency bounds, and drop the rest --- but being careful that an instant error may feed back into retry storms, sometimes it's better if such errors come after a delay, so that the client is stuck waiting (back pressure)

rjpruitt16•2h ago

Agree backoff+jitter is table stakes, and load shedding/backpressure is necessary under sustained overload. The tricky cases I’m digging into are shared rate limits (429s) and many concurrent clients/agents where local backoff isn’t coordinated and you still get herds after partial outages. Curious what patterns you’ve seen work well for coordinating retries/fairness across tenants or API keys?

HelloNurse•8h ago

A worrying choice of words.

"Losing sleep" implies an actual problem, which in turn implies that the mentioned mitigations and similar ones have not been applied (at least not properly) for dire reasons that are likely to be a more important problem than bad QoS.

"Infrastructure" implies an expectation that you deploy something external to the troubled application: there is a defective, presumably simplistic application architecture, and fixing it is not an option. This puts you in an awkward position: someone else is incompetent or unreasonable, but the responsibility for keeping their dumpster fire running falls on you.

rjpruitt16•2h ago

Fair pushback — to clarify, I’m not assuming incompetence or suggesting infra should paper over bad architecture.

By “losing sleep” I really mean on-call fatigue during partial outages — the class of incidents where backoff, shedding, and breakers exist, but retry amplification, shared rate limits, or degraded dependencies still cause noisy pages and prolonged recovery.

I’m trying to understand how teams coordinate retries and backpressure across many independent clients/services when refactors aren’t immediately available, not replace good architecture or take ownership of someone else’s system.

If you’ve seen patterns that consistently avoid that on-call pain at scale, I’d genuinely love to learn from them.

Ask HN: Is there anyone here who still uses slide rules?

Ask HN: Who wants to be hired? (February 2026)

Ask HN: Do you still use physical calculators?

Ask HN: Who is hiring? (February 2026)

Signal Is Down

Ask HN: Anyone have a "sovereign" solution for phone calls?

Kernighan on Programming

Ask HN: OpenClaw users, what is your token spend?

Ask HN: Have you been fired because of AI?

Best practices for powering and wiring addressable LED strip installs?

My small SaaS got recommended my Google in the AI search overview

Ask HN: What weird or scrappy things did you do to get your first users?

Ask HN: Where do all the web devs talk?

GitHub Actions Have "Major Outage"

Ask HN: Why dead code detection in Python is harder than most tools admit

CiderStack – Native macOS VM manager, pay once, no subscription

Google Cloud suspended my account for 2 years, only automated replies

Ask HN: Are you still using spec driven development?

Ask HN: Request limits vs. token limits for AI-powered apps?

Ask HN: Is anyone losing sleep over retry storms or partial API outages?

Ask HN: Has anybody moved their local community off of Facebook groups?

Ask HN: Anyone else struggle with how to learn coding in the AI era?

Ask HN: Interest in low cost / fast container registry?

Latex-wc: word count and word frequency for LaTeX projects

Ask HN: A proposal for interviewing "AI-Augmented" Engineers

Ask HN: Who is firing? (February 2026)

Why do people still talk about AGI?

Ask HN: Why are customer feedback boards so static? Building a live alternative

Ask HN: What are the immediate/near/long-term non-corporate benefits of AI?

Ask HN: Junior getting lost

Ask HN: Is anyone losing sleep over retry storms or partial API outages?

Comments

Ask HN: Is there anyone here who still uses slide rules?

Ask HN: Who wants to be hired? (February 2026)

Ask HN: Do you still use physical calculators?

Ask HN: Who is hiring? (February 2026)

Signal Is Down

Ask HN: Anyone have a "sovereign" solution for phone calls?

Kernighan on Programming

Ask HN: OpenClaw users, what is your token spend?

Ask HN: Have you been fired because of AI?

Best practices for powering and wiring addressable LED strip installs?

My small SaaS got recommended my Google in the AI search overview

Ask HN: What weird or scrappy things did you do to get your first users?

Ask HN: Where do all the web devs talk?

GitHub Actions Have "Major Outage"

Ask HN: Why dead code detection in Python is harder than most tools admit

CiderStack – Native macOS VM manager, pay once, no subscription

Google Cloud suspended my account for 2 years, only automated replies

Ask HN: Are you still using spec driven development?

Ask HN: Request limits vs. token limits for AI-powered apps?

Ask HN: Is anyone losing sleep over retry storms or partial API outages?

Ask HN: Has anybody moved their local community off of Facebook groups?

Ask HN: Anyone else struggle with how to learn coding in the AI era?

Ask HN: Interest in low cost / fast container registry?

Latex-wc: word count and word frequency for LaTeX projects

Ask HN: A proposal for interviewing "AI-Augmented" Engineers

Ask HN: Who is firing? (February 2026)

Why do people still talk about AGI?

Ask HN: Why are customer feedback boards so static? Building a live alternative

Ask HN: What are the immediate/near/long-term non-corporate benefits of AI?

Ask HN: Junior getting lost