How do you handle lost webhooks in production?

14•everydaydev•2mo ago

I've worked at several companies where we'd discover hours later that critical webhooks from Stripe/Shopify never arrived (deployment, timeout, bug, etc.).

Every team ended up building the same solution: retry logic, dead letter queue, monitoring.

Curious how others handle this: - Do you rely on the provider's retry policy? - Built your own reliability layer? - Use a service? - Just manually reconcile when it happens?

(Context: Building https://relaehook.com to solve this, but genuinely curious what the norm is)

Comments

samarthr1•2mo ago

Wait, so your product moves the point of failure from my infra to your infra?

Plus trusts y'all with contents of said webhook?

everydaydev•2mo ago

Fair question — we’re not eliminating failure so much as isolating it behind a system that’s purpose-built for durability. Our infra is built with redundant queues, retry pipelines, and observability you typically wouldn’t stand up for a single product team.

And on the data side, we don’t use webhook payloads for anything other than delivery. They’re encrypted at rest, transit, and automatically purged based on retention settings.

super256•2mo ago

Ofc I rely on the retry policy. Stripe retries with exponential back off for three days. If Stripe can't reach our endpoint in 3 days we probably went bankrupt or a solar flare ate IT.

everydaydev•2mo ago

Stripe does retries right, no argument there.

Where things get messy is when you have a mix of providers with wildly different retry behaviors, or internal services that have their own rate limits or downtime windows. A relay layer keeps the intake consistent even when the rest of the system isn’t.

nickphx•2mo ago

Yeaaaaaaaaaaaaah.. I am not sure adding an additional third party and point of potential failure would help mitigate the issue of receiving data from third parties... but good luck.

everydaydev•2mo ago

Fair point. The value isn’t in reducing the number of components, it’s in swapping a fragile one (your app endpoint) for something built specifically to stay up, queue, retry, and give you visibility when the rest of your stack isn’t. There are plenty of other services on the market that offer similar services.

renewiltord•2mo ago

Yeah, common problem. But trivial to solve. Just have minimal webhook server that records full request and return 200. Then process async.

Trivial Go program, day’s work. Stick it in Postgres, run continuously.

Bizarrely there are vendors who are weird about webhooks. Lifefile, as an example, charges pharmacies a dollar per webhook firing. So the pharmacies are crappy about retry policy.

Tbh I wouldn’t buy any product in this space. It’s too simple with exclusive HTTP server plus Postgres plus processing loop. And with already delicate thing I would rather not introduce more vendors.

No, not even if you converted it into event queue via websocket or zmq or what have you.

everydaydev•2mo ago

Your approach works, and lots of teams do exactly that. The tradeoff is that you’re now on the hook for uptime, retries, backpressure, tooling, on-call, metrics, etc.

Relae exists for teams who’d rather outsource that operational surface, similar to why people use managed queues instead of running their own RabbitMQ. Not everyone needs it — but some prefer not to own that part of the stack.

phillipseamore•2mo ago

svix.com

everydaydev•2mo ago

Svix is a solid managed webhook solution, and their platform is clearly geared toward enterprise teams. For smaller teams or startups, the same reliability patterns—durable delivery, retries, replay—are valuable but often at a lower cost point. That’s where products like Relae aim to make sense: providing similar operational guarantees in a way that’s more accessible for non-enterprise use cases.

journal•2mo ago

anomaly detection, checks to make sure something is still happening.

Clay Christensen's Milkshake Marketing (2011)

Show HN: WeaveMind – AI Workflows with human-in-the-loop

Show HN: Seedream 5.0: free AI image generator that claims strong text rendering

A contributor trust management system based on explicit vouches

Show HN: Analyzing 9 years of HN side projects that reached $500/month

The Floating Dock for Developers

Arcan Explained – A browser for different webs

We are not scared of AI, we are scared of irrelevance

Quartz Crystals

Show HN: I built a free dictionary API to avoid API keys

Show HN: Kybera – Agentic Smart Wallet with AI Osint and Reputation Tracking

Show HN: brew changelog – find upstream changelogs for Homebrew packages

Any chess position with 8 pieces on board and one pair of pawns has been solved

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

Projecting high-dimensional tensor/matrix/vect GPT–>ML

Show HN: Free Bank Statement Analyzer to Find Spending Leaks and Save Money

Our Stolen Light

Matchlock: Linux-based sandboxing for AI agents

Show HN: A2A Protocol – Infrastructure for an Agent-to-Agent Economy

Drinking More Water Can Boost Your Energy

Proving Laderman's 3x3 Matrix Multiplication Is Locally Optimal via SMT Solvers

Fire may have altered human DNA

"Compiled" Specs

The Next Big Language (2007) by Steve Yegge

Open-Weight Models Are Getting Serious: GLM 4.7 vs. MiniMax M2.1

Using AI for Code Reviews: What Works, What Doesn't, and Why

Show HN: Solnix – an early-stage experimental programming language

DoNotNotify is now Open Source

The British Empire's Brothels

What rare disease AI teaches us about longitudinal health

Clay Christensen's Milkshake Marketing (2011)

Show HN: WeaveMind – AI Workflows with human-in-the-loop

Show HN: Seedream 5.0: free AI image generator that claims strong text rendering

A contributor trust management system based on explicit vouches

Show HN: Analyzing 9 years of HN side projects that reached $500/month

The Floating Dock for Developers

Arcan Explained – A browser for different webs

We are not scared of AI, we are scared of irrelevance

Quartz Crystals

Show HN: I built a free dictionary API to avoid API keys

Show HN: Kybera – Agentic Smart Wallet with AI Osint and Reputation Tracking

Show HN: brew changelog – find upstream changelogs for Homebrew packages

Any chess position with 8 pieces on board and one pair of pawns has been solved

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

Projecting high-dimensional tensor/matrix/vect GPT–>ML

Show HN: Free Bank Statement Analyzer to Find Spending Leaks and Save Money

Our Stolen Light

Matchlock: Linux-based sandboxing for AI agents

Show HN: A2A Protocol – Infrastructure for an Agent-to-Agent Economy

Drinking More Water Can Boost Your Energy

Proving Laderman's 3x3 Matrix Multiplication Is Locally Optimal via SMT Solvers

Fire may have altered human DNA

"Compiled" Specs

The Next Big Language (2007) by Steve Yegge

Open-Weight Models Are Getting Serious: GLM 4.7 vs. MiniMax M2.1

Using AI for Code Reviews: What Works, What Doesn't, and Why

Show HN: Solnix – an early-stage experimental programming language

DoNotNotify is now Open Source

The British Empire's Brothels

What rare disease AI teaches us about longitudinal health

How do you handle lost webhooks in production?

Comments