I have yet to learn about this and will not be throwing some time into researching this topic.
Let's imagine that the process usually takes 1 minute and the tombstones are kept for 1 day. It would take something ridiculous to make the thing that usually takes 1 minute take longer than a day - not worth even considering. But sometimes there are a confluence of events that make such a thing possible... For example, maybe the top of rack switch died. The server stays running, it just can't succeed any upstream calls. Maybe it is continuously retrying while the network is down (or just slowly timing out on individual requests and skipping to the next one to try it). When the network comes back up, those calls start succeeding but now it's so much staler than you ever even thought was possible or planned for. That's just one scenario, probably not exactly what happened to AWS.
QA people deal with problems and edge cases most devs will never deal with. They’re your subject-matter-experts of 'what can go wrong'.
Anyway, the point is. You can’t trust anything "will resolve in time period X" or "if it takes longer than X, timeout". There are so many cases where this is simply not true and should be added to a "myths programmers believe" article if it isn't already there.
As is, this statement just means you can't trust anything. You still need to choose a time period at some point.
My (pedantic) argument is that timestamps/dates/counters have a range based on the number of bits storage they consume and the tick resolution. These can be exceeded, and it's not reasonable for every piece of software in the chain to invent a new way to store time, or counters, etc.
I've seen a fair share of issues resulting from processes with uptime of over 1 year and some with uptime of 5 years. Of course the wisdom there is just "don't do that, you should restart for maintenance at some point anyway" which is true, but it still means we are living with a system that theoretically will break after a certain period of time, and we are sidestepping that by restarting the process for other purposes.
Here we see the basic steps of modeling a complex system, and how that can be useful for understanding behavior even without knowing the details of every component.
> In this case, we can set up an invariant stating that the DNS should never be deleted once a newer plan has been applied
If that invariant had been expressed in the original code — as I'm sure it now is — it wouldn't have broken in the first place. The invariant is obvious in hindsight, but it's hardly axiomatic.
jldugger•10h ago
symbogra•9h ago
Insanity•3h ago
shakna•58m ago
cowsandmilk•8h ago
beanjuiceII•5h ago
deegles•4h ago