The fact that there is not a single root cause but several ones makes me instinctively think this is a good report, because it's not what the "bosses" (and even less politicians) like to hear.
ragebol•1h ago
Yep, sounds like "This was bound to happen at some point"
cucumber3732842•1h ago
Which on some level is exactly "what the bosses and politicians want to hear"
When it's everybody's fault it's nobody's fault.
drob518•58m ago
Exactly.
drob518•59m ago
Frequently, when you see these massive failures, the root cause is an alignment of small weaknesses that all come together on a specific day. See, for instance, the space shuttle O-ring incident, Three-Mile Island, Fukushima, etc. These are complex systems with lots of moving parts and lots of (sometimes independent) people managing them. In a sense, the complexity it the common root cause.
amelius•53m ago
It usually starts with a broken coffee machine.
linuxguy2•27m ago
It's like the Swiss Cheese model where every system has "holes" or vulnerabilities, several layers, and a major incident only occurs when a hole aligns through all the layers.
"You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed."
OgsyedIE•51m ago
There are ways to aggregate these into a single resilience score for policy makers with only moderate loss of detail but it's unpopular.
algoth1•54m ago
As someone who lived through the blackout it was wild. I felt back into the pre-internet, pre-smartphone era. It was pretty cool actually. The rumor mill spread so fast that Within hours the official word on the street was that we were getting hacked by a foreign military and people were joking that we had nothing of interest to be conquered xD
pfortuny•36m ago
The hack thing spread wildly, indeed. Weird experience.
madaxe_again•30m ago
I didn’t even know about it until the next day - totally off grid, and starlink for internet access - and no mobile signal where we live to give it away either.
singhrac•17m ago
I think people underestimate how valuable these reports are, so I’m very glad that detailed investigation is done here. Every major grid operator around the world is going to study this and make improvements to make sure this doesn’t happen on their grid.
In a lot of ways it’s like investigations into airplane crashes.
jacquesm•16m ago
472 pages. That's going to be a nice bit of reading this weekend. It is very nice to see such a comprehensive report as well as the fact that it was made public immediately.
darkwater•1h ago
ragebol•1h ago
cucumber3732842•1h ago
When it's everybody's fault it's nobody's fault.
drob518•58m ago
drob518•59m ago
amelius•53m ago
linuxguy2•27m ago
https://en.wikipedia.org/wiki/Swiss_cheese_model
Ringz•17m ago
anonymars•2m ago
https://devblogs.microsoft.com/oldnewthing/20080416-00/?p=22...
"You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed."
OgsyedIE•51m ago