
Recently I had beers with a friend and former coworker. As part of our catchup, he heard a two-beer version of my months’ long random walk through complex systems and resilience. I ranted about explained the importance of using postmortems to learn from failure in this setting, and was pleasantly surprised when he pinged me the next day to ask how he might improve his organization’s postmortems.
A slightly edited version of my email in response follows. It is essentially a sloppier, opinionated, concrete subset of the resilience-for-software README.
Improving postmortems to increase learning from failure
If you haven’t read it, I would recommend the Etsy guide as a starting point if you’re redesigning postmortems. You don’t/shouldn’t cargo cult all of it, but you (ed: the friend above) will note a strong contrast with your current process. https://how.complexsystems.fail is a good thing to keep in mind.
Some quick hits:
- Root causes do not exist, only contributing factors
- No broken part (from Drift Into Failure by Dekker)
- Nonlinearity and control vs influence (also from Drift Into Failure)
- Human error is a symptom, not a cause (from The Field Guide to Understanding ‘Human Error’, also by Dekker)
- Use incidents to learn about the gap between “work as imagined” vs “work as done”
- Be aware of the “dashboard trap” described in the Etsy guide
There are tons of resources here; further nodes to explore include:
- Anything by Allspaw, Cook, or Adaptive Capacity Labs
- Lorin Hochstein’s blog
Good luck and let me know if I can answer any followup questions!