Designing a Passively Safe API

https://www.danealbaugh.com/articles/passively-safe-apis

68•dalbaugh•1w ago

Comments

compressedgas•1w ago

> In APIs, passively safe means failures (crashes, timeouts, retries, partial outages) can't produce duplicate work, surprise side effects, or unrecoverable state.

I thought that was what 'idempotent' meant.

dalbaugh•1w ago

It's mostly semantics. Passive safety is the "why" while idempotency is the "how".

locknitpicker•1w ago

Idempotence is a trait of an operarion. Operations are idempotent, but systems were passive.

You don't have idempotent crashes.

omnicognate•1w ago

Idempotence of an operation means that if you perform it a second (or third, etc) time it won't do anything. The "action" all happens the first time and further goes at it do nothing. Eg. switching a light switch on could be seen as "idempotent" in a sense. You can press the bottom edge of the switch again but it's not going to click again and the light isn't going to become any more on.

The concept originates in maths, where it's functions that can be idempotent. The canonical example is projection operators: if you project a vector onto a subspace and then apply that same projection operator again you get the same vector again. In computing the term is sometimes used fairly loosely/analogistically like in the light switch example above. Sometimes, though, there is a mathematical function involved that is idempotent in the mathematical sense.

A form of idempotence is implied in "retries ... can't produce duplicate work" in the quote, but it isn't the whole story. Atomicity, for example, is also implied by the whole quote: the idea that an operation always either completes in its entirety or doesn't happen at all. That's independent of idempotence.

awildfivreld•1w ago

If anyone here wants to do this but don't want to implement all of this yourselves, this "field" is called Durable Execution. Frameworks such as Temporal, Restate and DBOS do a lot of the heavy lifting to get the idempotency, exactly once and recovery to a known state logic here.

fernandopj•1w ago

Second this. It's only been a few months since I started deploying Temporal at work, and there's no way that I would try implementing all this in-house.

fsociety•1w ago

Yes but one subtle point. Exactly once processing is not possible in these frameworks, they assume execution is idempotent which means at least once.

Now, any system I’ve seen designed around exactly once is complete garbage.

jedberg•5d ago

> Exactly once processing is not possible in these frameworks

Not entirely true, DBOS can guarantee exactly once execution if your API calls are idempotent.

vbezhenar•1w ago

That sounds like a lot of over engineering and a good way to never complete the project. Perfect is the enemy of good.

michalc•1w ago

Hmmm... depends on the project / phase of the project?

I am particularly not a fan of doing unnecessary work/over engineering, e.g. see https://charemza.name/blog/posts/agile/over-engineering/not-..., but even I think that sometimes things _are_ worth it

dxdm•1w ago

It sounds like a good way to make sure you don't overcharge your customers when handling such requests at scale. Failure and duplication will happen, and when serving enough requests will happen often enough to occupy engineering with investigation and resolution efforts forwarded from customer support.

Being prepared for these things to happen and having code in place to automatically prevent, recognize and resolve these errors will keep you, the customers and everyone in between sane and happy.

hmaxdml•1w ago

These are all important concerns, but I'd go for an off the shelf library that does it for me (disclaimer I work at https://github.com/dbos-inc)

vbezhenar•1w ago

You can just hire one person who will handle double charge issues and refund them when necessary. Might be much simpler and cheaper.

dxdm•5d ago

Pressing a refund button is not why engineering gets involved. It's the cleanup of the related data, metadata and automatically generated documents, because these are not consistent anymore. Of course you can automate that, or make at least create more buttons for non-engineering people to push, but then we're back to spending effort to anticipate these problems and enabling the system to prevent and/or handle them.

You also need to think about what it means to double-charge your customers, what it means to them and their wallets, and to their relationship to you. Do you want their repeat business? What sums are we talking about? How do you find out about these double-charges, and how quickly? Do the customers have to complain to you first, or did you anticipate the problem and have things in place to flag these charges?

Yes, you can hire people in place of the code you didn't write, but that only makes sense if continuing to pay them is cheaper than writing the code once and then maintaining it, which also probably means the manual work generated should not scale in proportion with your business.

Finally, developing for more than the happy-path is not overengineering, it's plain old engineering. There is a point, a kind and size of business, where it makes sense to do these things properly, and then TFA comes into play. The cost of just winging it goes up and up, until you need to do something about it.

locknitpicker•1w ago

> That sounds like a lot of over engineering and a good way to never complete the project. Perfect is the enemy of good.

Strong disagree. Addressing expectable failure modes is not over engineering. It's engineering. How do you put together a system without actually thinking through how it fails, and how to prevent those failure scenarios from taking down your whole operation?

vaylian•1w ago

> I'm in the process of migrating Augno's monolithic API to a microservices architecture.

Didn't we get to the point where we realized that microservices cause too much trouble down the road?

locknitpicker•1w ago

> Didn't we get to the point where we realized that microservices cause too much trouble down the road?

That's a largely ignorant opinion to have. Like any architecture, microservices have clear advantages and tradeoffs. It makes no sense to throw vague blanket statements at an architure style because you assume it "causes trouble", particularly when you know nothing about requirements or constraints and all architectures are far from bullet proof.

steve_adams_86•1w ago

For sure. There are some systems I would hate to build as a monolith and some systems I would hate not to. There's a good reason microservices showed up.

user3939382•1w ago

From TFA “Making some tasks asynchronous”

I have bad news for everyone. Nothing in computing is synchronous. Every instance we pretend it’s not and call it something else you have a potential failure under the right circumstances.

The more your design admits this the safer it will be. There are practical limits to this which you have to determine for yourself.

locknitpicker•1w ago

> I have bad news for everyone. Nothing in computing is synchronous.

I think you need to sit this one out. This sort of vacuous pedantry does no one any good, and ignores that it's perfectly fine to model and treat some calls are synchronous, including plain old HTTP ones. Just because everything is a state machine this does not mean you buy yourself anything of value by modeling everything as a state machine.

user3939382•1w ago

I think you should have sat it out actually. The “vacuous pedantry” is responsible for a huge class of bugs in computing. It’s juniors misunderstanding how these processes work or in any case developers not understanding where the wait is that cause all kinds of race conditions. So yes you absolutely buy yourself something by understanding and accounting for how the processes you’re wielding actually work.

zbentley•1w ago

Yeah no…many things are synchronous. I think you’re tilted about the fact that many things are “synchronous*” and that there can be important nuance hidden within the asterisk, but plenty of stuff is synchronous by default.

TCP connect(3) is synchronous. Making a directory on a local filesystem is synchronous. fsync(2) is synchronous. Committing an RDBMS transaction is synchronous.

user3939382•1w ago

“Yeah no” good way to start off. Not a single one of your examples is synchronous. In some practical sense, usually, so we call it that, but technically (the best kind of correct) they’re not, and it’s in these spaces precisely that bugs, attack vectors, undefined behavior, etc crop up. Computers are not synchronous architecturally. You issue a command, something else eventually sees that and processes it. Making a directory is a good example for clarity. That request is written to a buffer, cache, then eventually written to disk. It’s presented to you as synchronous for convenience but it’s not. So.. “yeah no” to your comment, you completely missed the point and truth in what I said.

zbentley•6d ago

I think you misunderstood the first half of my comment. Many of those things have an asterisk after “synchronous”, but in their default mode they often are in fact sync.

Saying “technically commands are posted and later observed, and that’s how we get security vulns” is an extraordinary claim. The vast, vast majority of program statements do not work thus. Memory operations immediately move instructions through the CPU to access data. Many IO operations request that the kernel immediately send an interrupt to a hardware device and wait for a response from that device to be written to memory—that’s synchronicity in the domain of electrical engineering, not just software. And sure, there’s periodicity and batching there (waiting for scheduler ticks and interrupt poll frequency and such), but none of that makes something less than synchronous; it just might slow it down. Unless you were referring only to timing attacks in your claim that security vulnerabilities result from not-really-synchronous actions, then I think that’s wrong twice over.

To expand on the examples: mkdir(2)’s durability (which is what we’re talking about when we refer to “synchronous-ness” of filesystem ops) depends—it depends on the filesystem and caching configuration of the system. But on many (I’d hazard most) file systems and configurations, new directories are persisted either immediately through the dentcache to the disk, or are during one of the next two calls to fsync(2), the next example I listed. And sure, there’s subtlety there! Your disk can lie, your RAID controller can lie and leave data in a write cache, exactly what is synchronous when you call fsync(2) depends on whether it’s the first or second call made in succession, and so on. But the fact remains that those calls do, in fact, block on the requested changes being made in many/most configurations. That’s far from your initial claim that “nothing is synchronous”.

Then consider the network examples. A connect(2) call isn’t like a socket send or filesystem write that might be cached or queued; that connect call blocks until a TCP negotiation with the target host’s network stack is performed. There’s an asterisk there as well (connection queues can make the call take longer, and we can have a semantic debate about atomicity vs synchronicity or “waiting for the thing to start” vs “waiting for the thing to finish” if you like), but the typical behavior of this action meets the bar for synchronicity as most people understand it.

The same is true further up the stack. A COMMIT RPC on most RDBMSes will indeed synchronously wait for the transaction’s changes to be persisted, or rolled back, to the database. The asterisks there are in the domain of two generals/client-server atomicity, or databases whose persistence settings don’t actually wait for things to be written to disk, but again: the majority of cases do, in fact, operate synchronously.

If “nothing is synchronous”, then how does read causality work? Like, my code can be relying on a hundred tiers of ephemeral and deceptive caches and speculatively executing out the ass, but a system call to read data from an IO source must necessarily be synchronous if I can observe that read when it finishes (either by seeing a read pointer advance on a file handle in the kernel, or just by using the data I read).

So … yeah, no. There’s nuance there, certainly. Deceptive and easy to mess up behavior, sure. But if that’s the complaint, say it—say “people do not understand the page cache and keep causing data loss because they assume write(2) works a certain way”. Say “people make wrong assumptions about the atomicity of synchronous operations”. Don’t say “nothing is synchronous”, because it isn’t true.

See https://rcrowley.org/2010/01/06/things-unix-can-do-atomicall..., https://datatracker.ietf.org/doc/html/rfc793.html#section-3...., and that amazing old … I think it was a JWZ article that compared data loss characteristics of Linux file systems given different power loss/fsync scenarios. Google isn’t helping me to find it, but perhaps someone who has it handy could link it here (unformatted HTML post that largely boiled down to “lots of drives and filesystems have durability issues; XFS and fsync are a good combination to achieve maximum durability”).

srinath693•1w ago

“Make the safe path the easiest path” is a great design principle. This should probably be a default mental model for public API design.

hmaxdml•1w ago

Durable execution has already been mentioned as the existing solution for this problem, but I would like to call out a specific pattern that DE makes obsolete: the outbox pattern. Imagine just being able to do do

send a() send b()

And know both will be sent at least once, without having to introduce an outbox and re-architect your code to use a message relay. We can nitpick the details, but being able to "just write normal code" and get strong guarantees is, imo, real progress.

zbentley•1w ago

To all the folks saying “durable execution frameworks solve this”—you’re right, but a lot of what’s described in the article isn’t quite the same as durable execution a la temporal. The approach described (transactional outboxes for side effectful operations, and care taken to be idempotent or resumable where possible, and to gracefully degrade, slow down, or rate limit where you can) achieves some of the same properties as a given durable execution framework, its true, but you don’t necessarily need to rewrite your code to be fully event sourced or use a framework to get a lot of those benefits, as the article demonstrates.

Transactional outboxes specifically are one of my favorite patterns: they’re not too hard to add and don’t require changing many core invariants of your system. If you already use some sort of message bus or queue, making publishes to it transactional under a given RDBMS is often as simple as adding some client side code and making sure that logical message deduplication and is present where appropriate: https://microservices.io/patterns/data/transactional-outbox....

If you use a separate message broker (Kafka, SQS, RabbitMQ) with this pattern, you’ll also need a sweeper cron job to re-dispatch failed publishes from the outbox table(s) as well.

Bonus points if this can be implemented on top of existing trigger-based audit table functionality.

ldng•1w ago

I find the emphasis on micro-service distracting.

I get that it is particularly valuable in that scenario by treating other services as "external API", but monolith also do call "external API" and delegate work to async tasks. The principles discussed here API are interesting beyond just micro-services while being lighter and simpler than Durable Execution.

Show HN: Medinilla – an OCPP compliant .NET back end (partially done)

How Does AI Distribute the Pie? Large Language Models and the Ultimatum Game

Resistance Infrastructure

Fire-juggling unicyclist caught performing on crossing

Restoring a lost 1981 Unix roguelike (protoHack) and preserving Hack 1.0.3

GPS and Time Dilation – Special and General Relativity

Show HN: Witnessd – Prove human authorship via hardware-bound jitter seals

Show HN: I built a clawdbot that texts like your crush

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA intends to take action against non-FDA-approved GLP-1 drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Show HN: Medinilla – an OCPP compliant .NET back end (partially done)

How Does AI Distribute the Pie? Large Language Models and the Ultimatum Game

Resistance Infrastructure

Fire-juggling unicyclist caught performing on crossing

Restoring a lost 1981 Unix roguelike (protoHack) and preserving Hack 1.0.3

GPS and Time Dilation – Special and General Relativity

Show HN: Witnessd – Prove human authorship via hardware-bound jitter seals

Show HN: I built a clawdbot that texts like your crush

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA intends to take action against non-FDA-approved GLP-1 drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Designing a Passively Safe API

Comments