Reproducing the AWS Outage Race Condition with a Model Checker

https://wyounas.github.io/aws/concurrency/2025/10/30/reproducing-the-aws-outage-race-condition-with-model-checker/

108•simplegeek•12h ago

Comments

jldugger•10h ago

Presumably that one guy at AWS who promotes TLA+ is furiously modeling all this himself in more detail for internal analysis.

symbogra•9h ago

Hah, I went to a few of his talks

Insanity•3h ago

Who are you guys talking about lol

shakna•58m ago

Likely Marc Brooker, who has given some pretty great talks.

cowsandmilk•8h ago

It’s more than one person at AWS now

beanjuiceII•5h ago

yea its an entire group of people that have yet to stop the next outage

deegles•4h ago

I'd like to see hard data on the outage rate of systems with and without their input

tonetegeatinst•9h ago

Wish the author had an introduction to model checker article.

I have yet to learn about this and will not be throwing some time into researching this topic.

throwaway81523•9h ago

I haven't used Alloy (alloytools.org) but it looks interesting and there are good docs. Did you mean "now" instead of "not"?

grogers•9h ago

Real world systems often have to deviate from the "pure" version used to run formal methods on. This could be how long you keep transaction logs for, or how long rows are tombstoned for, etc. The longer the time period, the costlier it usually is, in total storage cost and sometimes performance too. So you have to compromise with where you set the time period for.

Let's imagine that the process usually takes 1 minute and the tombstones are kept for 1 day. It would take something ridiculous to make the thing that usually takes 1 minute take longer than a day - not worth even considering. But sometimes there are a confluence of events that make such a thing possible... For example, maybe the top of rack switch died. The server stays running, it just can't succeed any upstream calls. Maybe it is continuously retrying while the network is down (or just slowly timing out on individual requests and skipping to the next one to try it). When the network comes back up, those calls start succeeding but now it's so much staler than you ever even thought was possible or planned for. That's just one scenario, probably not exactly what happened to AWS.

withinboredom•8h ago

In my mind, anything that has an actual time period is bound to fail, eventually. Then again, I hang around QA engineers a lot, and when you hear about the selenium troubles of "wait until an element is on the page" stories, you realise it relates to software in general.

QA people deal with problems and edge cases most devs will never deal with. They’re your subject-matter-experts of 'what can go wrong'.

Anyway, the point is. You can’t trust anything "will resolve in time period X" or "if it takes longer than X, timeout". There are so many cases where this is simply not true and should be added to a "myths programmers believe" article if it isn't already there.

sigseg1v•7h ago

>You can't trust anything "will resolve in time period X"

As is, this statement just means you can't trust anything. You still need to choose a time period at some point.

My (pedantic) argument is that timestamps/dates/counters have a range based on the number of bits storage they consume and the tick resolution. These can be exceeded, and it's not reasonable for every piece of software in the chain to invent a new way to store time, or counters, etc.

I've seen a fair share of issues resulting from processes with uptime of over 1 year and some with uptime of 5 years. Of course the wisdom there is just "don't do that, you should restart for maintenance at some point anyway" which is true, but it still means we are living with a system that theoretically will break after a certain period of time, and we are sidestepping that by restarting the process for other purposes.

saurik•6h ago

Yeah... it has felt kind of ridiculous over the years how many times I have tracked some but I was experiencing down to a timeout someone added in the code for a project I was working with, and I have come to the conclusion over the years that the fix is always to remove the timeout: the existence of a timeout is, inherently, a bug, not a feature, and if your design fundamentally relies on a timeout to function, then the design is also inherently flawed.

philipwhiuk•9h ago

I don't really understand the purpose of this. It's not like they have anything other than the RCA (e.g. the code)

__float•8h ago

A lot of people view model checking and similar tools as too theoretical, academic Stuff that can't be so easily applied to the real world.

Here we see the basic steps of modeling a complex system, and how that can be useful for understanding behavior even without knowing the details of every component.

muglug•5h ago

I'm a fan of more formal methods in progam analysis, but this particular excercise is very hindsight-is-20/20

> In this case, we can set up an invariant stating that the DNS should never be deleted once a newer plan has been applied

If that invariant had been expressed in the original code — as I'm sure it now is — it wouldn't have broken in the first place. The invariant is obvious in hindsight, but it's hardly axiomatic.

pas•4h ago

not deleting the active plan seems like a basic fail-safe design choice, and this isn't AWS people's first rodeo. likely there was some rationale for not going with a built-in fallback.

nsatirini•1h ago

Every such analysis will have some hindsight bias. Still, it’s a great post that shows how to model such behavior. And I agree with the another reply that not deleting an active plan seems like a basic fail safe choice which the post also covered

mmiao•4h ago

imho, model checker suits for the problem with many different states and complex state transformation. But in this case, it's a simple toctou problem.. Using model checker sounds weird for me

First recording of a dying human brain shows waves similar to memory flashbacks

'No idea who he is', says Trump after pardoning crypto tycoon

Python 3.14 added support for max-heaps

Cloudflare Domain Ranking Hiijacked

ECL Runs Maxima in a Browser

Ukrainian computer game-style drone

Best AI-Powered Learning Apps for Kids Ages 2-5: A Parent's Guide

State of Terminal Emulators in 2025

From Signals to Reliability: SLOs, Runbooks and Post-Mortems

Google's Jeff Dean on the Coming Era of Virtual Engineers

Variation Is King, Not the Average

Show HN: 1D Pac-Man running on my custom ARM-like fantasy console in the browser

Kimi Linear: An Expressive, Efficient Attention Architecture

Cons Should Not Cons Its Arguments, Part II: Cheney on the MTA

Experts Debate Whether Bitcoin's Four-Year Cycle Still Holds in 2025

Lens Mounting – A Systematic Approach [pdf]

An evergreen template for running Node.js in a Hugging Face Space

Tutorials in Optomechanics

Would you trust an open-source savings tool that earns yield without app fees?

Klein Bottles and Nuclear Fusion

Restock Holmes

How is Affinity now free? (Founder) [video]

Study: Good management of aid projects reduces local violence

Hooked on Sonics: Experimenting with Sound in 19th-Century Popular Science

Full list of Israeli startup M&As in 2025

Digital Twins: the missing pieces we can solve with Machine Learning

Show HN: Sudachi Emulator – Fast open-source Switch emulator

Setting up a simple home router with OpenBSD

Programming for Computations: Matlab/Octave

Coriolis Carousel: Demo [video]