
When I wrote up QTSP on How Complex Systems Fail two weeks ago, I forgot to include other interesting reviews of the paper.
The first, unsurprisingly, is from John Allspaw in 2009 — this is before Allspaw coined “blameless postmortem”. Allspaw rejects the paper and embraces strict adherence to the Toyota Production System embraces the paper:
I don’t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure.
It is interesting to see “early Allspaw’s” view on topics like the 5 Whys:
I believe that even a rudimentary process of “5 Whys” has value.(Update: I did when I first wrote this. Now, I do not. ) But at the same time, I also think that there is something in the spirit of this paragraph, which is that there is a danger in standing behind a single underlying cause when there are systemic failures involved.
There are probably many worse ways to spend your time than walking parts of the “Allspaw trail”, even a decade removed.
Six years later, the don of paper blogging, Adrian Colyer of The Morning Paper fame, picks up the mantle:
This is a wonderfully short and easy to read paper looking at how complex systems fail – it’s written by a Doctor (MD) in the context of systems of patient care, but that makes it all the more fun to translate the lessons into complex IT systems, including their human operator components.
I think about Cook’s paper often. Recently I’ve been thinking about #18, failure free operations require experience with failure. This is seemingly a paradox — we want to reduce failure, which requires experience from failure. Where does this experience comes from once the failure is reduced?
Some interesting answers might be learning focused postmortems where we can learn from failure indirectly, and chaos engineering experiments where we can learn from failure in controlled conditions. The “resilience in software” community’s focus on these domains begins to come into focus…