frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Ask HN: Is anyone losing sleep over retry storms or partial API outages?

2•rjpruitt16•13h ago
I’m working on infrastructure to solve retry storms and outages. Before I go further, I want to understand what people are actually doing today. Compare solutions and maybe help someone see potential solutions. The problems:

Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.

Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.

What I’m curious about: ∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?) ∙ How well does it work? What are the gaps? ∙ What scale are you at? (company size, # of instances, requests/sec)

I’d love to hear what’s working, what isn’t, and what you wish existed.

Comments

toast0•11h ago
Retry storms are "easy" exponential backoff with jitter. Like what ethernet on shared media has been doing since the 80s.

If that's not enough to come back from an outage, you need to put in load shedding and/or back pressure. There's no sense accepting all the requests and then not servicing any in time.

You want to be able to accept and do work on requests that are likely to succeed within reasonable latency bounds, and drop the rest --- but being careful that an instant error may feed back into retry storms, sometimes it's better if such errors come after a delay, so that the client is stuck waiting (back pressure)

rjpruitt16•2h ago
Agree backoff+jitter is table stakes, and load shedding/backpressure is necessary under sustained overload. The tricky cases I’m digging into are shared rate limits (429s) and many concurrent clients/agents where local backoff isn’t coordinated and you still get herds after partial outages. Curious what patterns you’ve seen work well for coordinating retries/fairness across tenants or API keys?
HelloNurse•8h ago
A worrying choice of words.

"Losing sleep" implies an actual problem, which in turn implies that the mentioned mitigations and similar ones have not been applied (at least not properly) for dire reasons that are likely to be a more important problem than bad QoS.

"Infrastructure" implies an expectation that you deploy something external to the troubled application: there is a defective, presumably simplistic application architecture, and fixing it is not an option. This puts you in an awkward position: someone else is incompetent or unreasonable, but the responsibility for keeping their dumpster fire running falls on you.

rjpruitt16•2h ago
Fair pushback — to clarify, I’m not assuming incompetence or suggesting infra should paper over bad architecture.

By “losing sleep” I really mean on-call fatigue during partial outages — the class of incidents where backoff, shedding, and breakers exist, but retry amplification, shared rate limits, or degraded dependencies still cause noisy pages and prolonged recovery.

I’m trying to understand how teams coordinate retries and backpressure across many independent clients/services when refactors aren’t immediately available, not replace good architecture or take ownership of someone else’s system.

If you’ve seen patterns that consistently avoid that on-call pain at scale, I’d genuinely love to learn from them.

Ask HN: Is there anyone here who still uses slide rules?

95•blenderob•3h ago•96 comments

Ask HN: Who wants to be hired? (February 2026)

126•whoishiring•1d ago•347 comments

Ask HN: Do you still use physical calculators?

52•speedylight•3d ago•112 comments

Ask HN: Who is hiring? (February 2026)

297•whoishiring•1d ago•383 comments

Signal Is Down

33•Daniel_sk•1h ago•10 comments

Ask HN: Anyone have a "sovereign" solution for phone calls?

4•kldg•5h ago•0 comments

Kernighan on Programming

156•chrisjj•1d ago•52 comments

Ask HN: OpenClaw users, what is your token spend?

11•8cvor6j844qw_d6•17h ago•4 comments

Ask HN: Have you been fired because of AI?

8•s-stude•11h ago•13 comments

Best practices for powering and wiring addressable LED strip installs?

2•emmasuntech•8h ago•0 comments

My small SaaS got recommended my Google in the AI search overview

2•kaave•8h ago•2 comments

Ask HN: What weird or scrappy things did you do to get your first users?

12•preston-kwei•18h ago•6 comments

Ask HN: Where do all the web devs talk?

53•LinguaBrowse•14h ago•49 comments

GitHub Actions Have "Major Outage"

52•graton•21h ago•15 comments

Ask HN: Why dead code detection in Python is harder than most tools admit

4•duriantaco•14h ago•0 comments

CiderStack – Native macOS VM manager, pay once, no subscription

4•ciderdev•10h ago•2 comments

Google Cloud suspended my account for 2 years, only automated replies

156•andylizf•2d ago•90 comments

Ask HN: Are you still using spec driven development?

3•cherry_tree•16h ago•1 comments

Ask HN: Request limits vs. token limits for AI-powered apps?

2•JeduDev•13h ago•0 comments

Ask HN: Is anyone losing sleep over retry storms or partial API outages?

2•rjpruitt16•13h ago•4 comments

Ask HN: Has anybody moved their local community off of Facebook groups?

20•madsohm•1d ago•14 comments

Ask HN: Anyone else struggle with how to learn coding in the AI era?

42•44Bulldog•14h ago•58 comments

Ask HN: Interest in low cost / fast container registry?

2•osigurdson•15h ago•0 comments

Latex-wc: word count and word frequency for LaTeX projects

2•sethbarrettAU•16h ago•2 comments

Ask HN: A proposal for interviewing "AI-Augmented" Engineers

3•vanbashan•16h ago•1 comments

Ask HN: Who is firing? (February 2026)

22•chalmovsky•18h ago•3 comments

Why do people still talk about AGI?

41•cermicelli•1d ago•59 comments

Ask HN: Why are customer feedback boards so static? Building a live alternative

2•develotor•18h ago•2 comments

Ask HN: What are the immediate/near/long-term non-corporate benefits of AI?

2•0x4e•18h ago•11 comments

Ask HN: Junior getting lost

49•TheRegularOne•4d ago•37 comments