Ask HN: How do you monitor and retry failed webhooks in production?

2•GoatPerfect•1h ago

I’ve been working on a project where webhooks are a core part of the system, and I realized how fragile they can be in practice.

Transient network errors, timeouts, downstream issues — things fail more often than expected.

I’m curious how others are handling this in production.

Are you building custom retry logic?

Using a queue?

Relying on provider retries?

Just logging and manually checking failures?

Do you monitor webhook delivery rates or alert on repeated failures?

Would love to hear what setups people are using and what’s worked (or not worked) for you.

Comments

toomuchtodo•1h ago

Have you checked out https://svix.com? No affiliation, I just like the product. Might also check out https://www.standardwebhooks.com/

GoatPerfect•32m ago

I just checked them out! Looks like it would make handling failures a breeze!

blundergoat•1h ago

We treat webhooks as at-least-once delivery over an unreliable transport and design for duplicates and out-of-order events.

A few rules that have saved us:

- Persist before responding. Never process inline. Write payload to DB, return 200 fast.

- Idempotency key required. Either provider event ID or hash the payload.

- Async worker processes from queue. Exponential backoff + max attempts.

- Dead letter queue + dashboard. Humans need visibility.

- Alert on backlog growth, not single failures. One failure is noise. A growing retry queue is signal.

- Relying on provider retries alone has bitten us more than once.

GoatPerfect•35m ago

Thank you so much for tips! I was feeling nervous about relying on provider retires as well. I especially like the idea of alerting on backlog growth. There's nothing I hate more than a bunch of emails and notifications!

chickensong•9m ago

This was a nice goat exchange

JacobArthurs•37m ago

We receive the webhook, return 200 immediately, and push the payload to a message queue for processing. That way you own the retry logic, can inspect stuck messages, and DLQ alerts handle repeated failures automatically.

Idempotency becomes your responsibility, though, since messages can be delivered more than once.

Fighting games have a product design problem

Show HN: My Degenerate Craps Simulator

Shell Permission Errors for Busy Coding Agents

Emacs native/idiomatic Claude Code UI

Show HN: BetaZero, a diffusion climb generator for system boards

Show HN: Ktop – a themed terminal monitor for GPU, CPU, RAM, temps and OOM kills

Software Devaluation Starts

Org Politics 101 – The Myth of the Just World (2013)

Self-Hosting

Shandu – open-source DeepResearch system

Those who can, teach history

Peter Steinberger on Hacker News

Open APIs Are Over

Two years of vector search at Notion: 10x scale, 1/10th cost

Challenges and trends in sparse matrix multiplication on HPC workloads [video]

Western US gripped by extreme snow drought: 'I've never seen a winter like this'

Pentagon demonstrates U.S. potential to quickly deploy nuclear power via airlift

Theft of trade secrets is on the rise

Questions to ask yourself every decade

Coalition for Health AI (CHAI) labs scrapped

Show HN: Saga – A Jira-like project tracker MCP server for AI agents (SQLite)

From quantum computing to mRNA therapeutics: seven technologies to watch in 2026

Show HN: Compare logs before and after deployment to catch regressions

The link between material and moral flourishing is real

Jared Sleeper on Which Software Companies Will Survive the SaaSpocalypse [video]

The Choice Ahead (For SaaS Companies in an AI World)

Geoscience Australia explores storing hydrogen in Adavale Basin salt caverns

Animal Mummy

Visual Engineering with Any Coder Agent

Show HN: Skooless – Swipe through 5-min micro-lessons instead of TikTok (PWA)