I forget where I heard this but, “you manage risk, but risk cannot be managed.” Ie. there is no terminal state where it’s been “solved.” It’s much like “culture of safety.”
If the result/accident is too bad though you need to find all the different faults and mitigate as manyias possible the first time.
Causal Analysis based on Systems Theory - my notes - https://github.com/joelparkerhenderson/causal-analysis-based...
The full handbook by Nancy G. Leveson at MIT is free here: http://sunnyday.mit.edu/CAST-Handbook.pdf
People rarely react well if you tell them "Hey this feature ticket you made is poorly conceived and will cause problems, can we just not do it?" It is easier just to implement whatever it is and deal with the fallout later.
I was asked to write up such a document for an incident where our team had written a new feature which, upon launch, did absolutely nothing. Our team had accidentally mistyped a flag name on the last day before we handed it to a test team, the test team examined the (nonfunctional) tool for a few weeks and blessed it, and then upon turning it on, it failed to do anything. My five whys document was most about "what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do."
I recall my manager handing the doc back to me and saying that I needed to completely redo it because it was unacceptable for us to blame another team for our team's bug, which is how I learned that you can make a five why process blame any team you find convenient by choosing the question. I quit not too long after that.
My first thought is why is rolling out a new system to prod that is not used yet an incident? I dont think "being in prod" is sufficient. There should be tiers of service and a brand new service should not be on a tier where it having teething issues is an incident.
> what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do
would be interested to see the doc, but imagine you'd branch off the causes, one branch of the tree is: UAT didnt pick up the bug. why didn't UAT pick up the bug? .... (you'd need that teams help).
I think that team would have something that is a contributing cause. You shouldn't rely on UAT to pick up a bug in a released product. However just because it is not a root cause doesnt mean it shouldn't be addressed. Today's contributing cause can be tomorrow's root cause!
So yeah yiu dont blame another team but you also dont shield another team from one of their systems needing attention! The wording matters alot though.
The way you worded the question seems a little loaded. But you may be paraphrasing? 5 whys are usually more like "Why did they papaya team not detect the bug before deployment?"
Whereas
> what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do
Is more emotive. Sounds like a cross examiners question which isn't the vibe you'd want to go for. 5 whys should be 5 drys. Nothing spicy!
It took a while for all the teams to embrace the rca process without fear and finger pointing, but now that it's trusted and accepted, problem management stream / rca process probably the healthiest / best viewed of our streams and processes :-)
Edit: You can see others commenting on precisely this. Examples:
https://news.ycombinator.com/item?id=45573027
https://news.ycombinator.com/item?id=45573101
Why did the testing team not catch that the feature was not functional?
This is covered by LINK
A few problems I faced:
- culturally a lack of deeper understanding or care around “safety” topics. Forces that be inherently are motivated by launching features and increasing sales, so more often than not you could write an awesome incident retro doc and just get people who are laser focused on the bare minimum of action items.
- security folks co-opting the safety things, because removing access to things can be misconstrued to mean making things safer. While somewhat true, it also makes doing jobs more difficult if not replaced with adequate tooling. What this meant was taking away access and replacing everything with “break glass” mechanisms. If your team is breaking glass every day and ticketing security, you’re probably failing to address both security and safety..
- related to the last point, but a lack of introspection as to the means of making changes which led to the incident. For example: user uses ssh to run a command and ran the wrong command -> we should eliminate ssh. Rather than asking why was ssh the best / only way the user could affect change to the system? Could we build an api for this with tooling and safeguards before cutting off ssh?
Someone said the quiet part loud! :
"""
Common circumstances missing from accident reports are:
Pressures to cut costs or work quicker,
Competing requests for colleagues,
Unnecessarily complicated systems,
Broken tools,
Biological needs (e.g. sleep or hunger),
Cumbersome enforced processes,
Fear of being consequences of doing something out of the ordinary, and
Shame of feeling in over one’s head.
"""
> If we analyse accidents more deeply, we can get by analysing fewer accidents and still learn more.
Yeah, that's not how it works. The failure modes of your system might be concentrated in one particularly brittle area, but you really need as much breadth as you can get: the bullets are always fired at the entire plane.
> An accident happens when a system in a hazardous state encounters unfavourable environmental conditions. We cannot control environmental conditions, so we need to prevent hazards.
I mean, I'm an R&D guy, so my experience is biased, but... sometimes the system is just broke and no amount of saying "the system is in a hazardous state" can paper over the fact that you shipped (or, best-case, stress-tested) trash. You absolutely have to run these cases through the failure analysis pipeline, there's no out there, but the analysis flow looks a bit different for things that should-have worked versus things that could-never-have worked. And, yes, it will roll up on management, but... still.
kqr•2d ago
maybelsyrup•1h ago