Allspawn Pokemon: gotta catch ‘em all, video edition

John Allspaw is the godfather of resilience engineering in software, dating back to his introducing the term and practice of “blameless postmortems” in a 2012 Etsy blog post. Allspaw is a prolific speaker, but there has never been a full timeline for the decades-long “incident” of his public speaking career… until now. I am likely missing some talks, please send me any additions or errors. I’ve skipped videos under fifteen minutes in length and those behind paywalls.

  1. 2009-06-23 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr (w/ Paul Hammond) from Velocity 2009
  2. 2011-06-28 Building Resilience in Web Development and Operations from USI 2011
  3. 2011-11-09 Anticipation: What Could Possibly Go Wrong? from Velocity EU 2011
  4. 2011-11-16 Outages, Post Mortems, and Human Error 101 from Etsy Tech Talk
  5. 2012-04-23 Interview from GOTO Chicago 2013
  6. 2012-06-26 Stronger and Faster (w/ Steve Souders) from Velocity 2012
  7. 2012-09-25 Interview with Jez Humble
  8. 2013-05-14 Owning Attention: Alert Design Considerations from Etsy Tech Talk
  9. 2013-11-13 AMA from Velocity EU 2013
  10. 2013-11-22 Fireside Chat with Andrew Clay Shafer
  11. 2014-01-29 An Evening with John Allspaw on Development and Deployment at Etsy from Data Council
  12. 2014-06-24 Interview from Velocity 2014
  13. 2014-06-24 PostMortem Facilitation: Theory and Practice of “New View” Debriefings Parts One, Two, Three, Four from Velocity 2014
  14. 2015-05-28 Seeing the Invisible: Discovering Operations Expertise from Velocity 2015
  15. 2016-05-25 Common Ground and Coordination in Joint Activity from Papers We Love
  16. 2017-11-15 How Your Systems Keep Running Day After Day: Resilience Engineering as DevOps from DOES 2017
  17. 2018-03-20 Poised To Adapt: Continuous Delivery’s Relationship To Resilience Engineering from PipelineConf 2018
  18. 2018-04-24 Taking Human Performance Seriously in Software from DevOpsDays Seattle 2018
  19. 2018-08-16 In the Center of the Cyclone: Finding Sources of Resilience from Redeploy 2018
  20. 2018-09-12 Interview from PagerDuty Summit 2018
  21. 2018-09-12 Incidents as we Imagine Them Versus How They Actually Are from PagerDuty Summit 2018
  22. 2018-10-15 Problem Detection from Papers We Love
  23. 2019-02-11 Video AMA from PagerDuty 2019
  24. 2019-06-03 Taking Human Performance Seriously In Software from Monitorama PDX 2019
  25. 2019-07-08 Resilience Engineering: The What and How from DevOpsDays DC 2019

Bonus: podcasts

  1. 2016-02-13 PAPod 57 – System Reliability – John Allspaw from PreAccident Investigation Podcast
  2. 2017-03-07 John Allspaw on System Failures: Preventing, Responding, and Learning From Failure from SE-Radio
  3. 2018-09-05 096: Resilience Engineering with John Allspaw from Greater Than Code

Practical postmortem performance, personal prescription

Recently I had beers with a friend and former coworker. As part of our catchup, he heard a two-beer version of my months’ long random walk through complex systems and resilience. I ranted about explained the importance of using postmortems to learn from failure in this setting, and was pleasantly surprised when he pinged me the next day to ask how he might improve his organization’s postmortems.

A slightly edited version of my email in response follows. It is essentially a sloppier, opinionated, concrete subset of the resilience-for-software README.

Improving postmortems to increase learning from failure

If you haven’t read it, I would recommend the Etsy guide as a starting point if you’re redesigning postmortems. You don’t/shouldn’t cargo cult all of it, but you (ed: the friend above) will note a strong contrast with your current process. https://how.complexsystems.fail is a good thing to keep in mind.

Some quick hits:

  • Root causes do not exist, only contributing factors
  • No broken part (from Drift Into Failure by Dekker)
  • Nonlinearity and control vs influence (also from Drift Into Failure)
  • Human error is a symptom, not a cause (from The Field Guide to Understanding ‘Human Error’, also by Dekker)
  • Use incidents to learn about the gap between “work as imagined” vs “work as done”
  • Be aware of the “dashboard trap” described in the Etsy guide

There are tons of resources here; further nodes to explore include:

Good luck and let me know if I can answer any followup questions!

Quick thoughts on short papers: How Complex Systems Fail

How Complex Systems Fail by Richard Cook is one of my favorite papers. The fact that it was written in 1998, without a particular focus on software,  makes it that much better. If you have experience in “operations” – aka “production” — this paper will make you feel seen. 

This paper makes eighteen observations about complex systems and includes commentary on each.  My favorites, grouped by theme (but seriously, this paper is short and amazing, you should read it!):

1. Complex systems are intrinsically hazardous systems.

2. Complex systems are heavily and successfully defended against failure.

3. Catastrophe requires multiple failures – single point failures are not enough.

4. Complex systems contain changing mixtures of failures latent within them.

5. Complex systems run in degraded mode.

6. Catastrophe is always just around the corner.

“Abandon all hope ye who enter here” — or, at least, the hope of a linear, understandable system which you can be confident will avoid catastrophe if only enough best practices are applied.

7.  Post-accident attribution to a ‘root cause’ is fundamentally wrong.

8.  Hindsight biases post-accident assessments of human performance

We want to believe that — and it is so much easier for us if — failure corresponds to a “broken part” somewhere in our system and we can be “done” once we fix it. But that’s flat out not the case. There is no root cause.  Similarly, since failures involve humans — either their action or inaction — it would be easy for us if “human error” was a cause, to be dealt with, perhaps, by mandating further training or imposing punishment.  But it is, instead, a symptom of deeper problems in our complex socio-technical systems.

Sidney Dekker has written great books on these topics if you are interested in learning more.

10. All practitioner actions are gambles.

11. Actions at the sharp end resolve all ambiguity

I wrote recently about the difference between work as imagined and work as done.  We see here that work as done is the ground truth and is made up of operators making a sequence of choices, each of which they perceive — given their knowledge and constraints at the time — to be the best option available. In other words, a sequence of gambles!

16. Safety is a characteristic of systems and not of their components.

Every component may be working correctly and the system can still be broken :-(. Or, as someone who had not yet read this paper might wonder,

Lorin is blogging

In the last week, Lorin Hochstein has posted five new posts on his blog.  What makes this particularly exciting is that, as far as I know, Lorin is one of the few full-time resilience practitioners working in software. He is a member of Netflix’s Cloud Operations and Reliability Engineering (CORE)  team. Surprisingly, not much has been written about CORE. Perhaps the most insightful text is their Senior Resilience Engineering Advocate job post:

The goal of the education and outreach function of the SRE team is to improve reliability and resilience of Netflix services by focusing on the people within the company, since it’s the normal, everyday work of Netflix employees that creates our availability. To cultivate operational excellence, we reveal risks, identify opportunities, and facilitate the transfer of skills and expertise of our staff by sharing experiences.

Following an operational surprise, we seek to understand what the world looked like from the perspective of the people involved. We facilitate interviews, analyze joint activity, and produce artifacts like written narrative documents. Relationship building is a huge part of this role. Someone advocating for resilience engineering within Netflix will help stakeholders realize when this type of work is most effective.


We Think About

Netflix as a socio-technical system is formed from the interaction of people and software. This system has many components and is constantly undergoing change. Unforseen interactions are common and operational surprises arise from perfect storms of events.

Surprises over incidents and recovery more than prevention.We encourage highlighting good catches, the things that help make us better, and the capacity we develop to successfully minimize the consequences of encountering inevitable failure. A holistic view of our work involves paying attention to how we are confronted with surprises every day and the actions we take to cope with them.

This is not the sort of language you typically see in the job description for an SRE! Lorin’s colleague Ryan Kitchens also gave a talk at SRECon19 that touches on CORE’s approach.

Back to those blog posts

Many of the topics Lorin has written on in this flurry of posts have been covered before by folks like Allspaw, Cook, Dekker, or Woods. Lorin’s writing is a great counterpoint to these voices and I think well suited for those new to resilience. He writes plainly and smoothly, and with the perspective of someone who has been in the trenches doing the work for some time.

Lorin’s recent blogposts cover:

I look forward to more blogs from Lorin, even if he ends up slowing down a bit. Lorin is also not the only resilience practitioner blogging nor do they all work at Netflix. I hope to highlight others’ great work in future blog posts.

Quick thoughts on short papers: “Failure to adapt or adaptations that fail: contrasting models on procedures and safety”

I’ve been consuming a lot of content in the last few months — books, papers, blog posts, talks, podcasts, you name it. I’ve seen exciting aha! moments and spent hours trying to wrap my head around seemingly simple but, in fact, maddeningly complex concepts. In this setting, the right sort of short paper is like a refreshing cold drink on a hot, humid day. These papers argue a core idea of goldilocks size and serve as landmarks and reference points as I sail through seas full of dragons.

The most recent paper of this type I have run into is Failure to adapt or adaptations that fail: contrasting models on procedures and safety by Sidney Dekker, from 2001 (hat tip to John Allspaw for passing it along). This paper has become my go-to for the idea that work as imagined and work as done are not the same. That is,

People at work must interpret procedures with respect to a collection of actions and circumstances that the procedures themselves can never fully specify (e.g. Suchman, 1987). In other words, procedures are not the work itself. Work, especially that in complex, dynamic workplaces, often requires subtle, local judgments with regard to timing of subtasks, relevance, importance, prioritization and so forth.

And:

 Procedures are resources for action. Procedures do not specify all circumstances to which they apply. Procedures cannot dictate their own application. Procedures can, in themselves, not guarantee safety. 

Applying procedures successfully across situations can be a substantive and skillful cognitive activity. 

Safety results from people being skillful at judging when (and when not) and how to adapt procedures to local circumstances. 

And, finally:

There is always a tension between centralized guidance and local practice. Sticking to procedures can lead to ineffective, unproductive or unsafe local actions, whereas adapting local practice in the face of pragmatic demands can miss global system goals and other constraints or vulnerabilities that operate on the situation in question. 

To apply to software engineering, simply replace procedures or guidance with runbooks or documentation, think about the last time someone in your organization went “offroad”, and what the response was.  The beauty of short papers is that they don’t need much more analysis than that. Just read it yourself — and let me know what you think!

What’s up funemployment

I left Lyft in June 2019. In the leadup to my departure, I decided to take a period of “structured funemployment,”  rather than pipeline a job search and quickly get back to W-2 work. A break like this — where I set the agenda and have time to reflect on the past and future — is something I’ve been promising myself I would do the next time I had a break between jobs. Achievement unlocked!

I’m privileged to have the financial/logistical space to take a break like this and especially so to structure this experience around my membership in the South Park Commons community, which I joined in July (h/t Matt for the referral). A group that “brings together talented people to share ideas, explore directions and realize the opportunities that’ll get you there” is exactly what I was looking for. The Commons is an inspiring place, full of brilliant, kind, hard-working people with varying backgrounds and goals.  Spending my day at the Commons is energizing in a way that sitting at home is not.

In this setting, I’ve chosen to focus my attention so far on understanding how complex systems fail and the application of resilience engineering to software. This decision is motivated in large part by my experience with operations, failure and the incident lifecycle during my time at Lyft. It turns out there’s decades of research on why these problems are hard (e.g., why we can’t don’t have nice things!).

I find this domain fascinating and am framing my pursuit as independent research, rather than an intent to launch a startup “in this space” (as they say). It’s… complex, but I don’t think this work aligns well with the sort of mechanics (e.g., metrics, growth) associated with success as a venture-backed company (but reach out if you think otherwise!). One data point here is that John Allspaw, the godfather of “resilience for tech” and former CTO of Etsy, is operating a consultancy, not a startup selling a product. My working hypothesis is that I’ll leave funemployment to work in this area inside a maturing hyper-growth company — think decacorns like Airbnb and Stripe. I still have at least a few months left of funemployment and research, but I’m happy to chat with folks at organizations looking to invest/hire in this area — drop me a note!  

So far, in practice, independent research on resilience engineering means consumption and conversation. Looking at the ever-increasing list of papers, books, and talks I have read/watched and have outstanding, it is definitely the best and worst of unbounded queue times. It is also great to have connected with individuals and communities — primarily on Twitter and in various Slacks — where there’s a healthy and ongoing discourse on resilience and related topics. Plenty of excellent chats over coffee as well — and funemployment means a flexible schedule, so if you’re interested in talking shop, just let me know.

While I’d given myself top marks for participation on Slack and Twitter (@jhscott), I’ve been slower to produce longer form analysis and writing. I’ve started drafting talk proposals for related conferences, with the first being REdeploy, which I’ll be attending in San Francisco in October. I’m also hoping to use this blog to distill what I’ve learned (duh). Look for a writeup of what SREs even do in a world of production ownership where developers hold the pagers in the next few days.

Lets chat

Curiosity and free time is a dangerous and excellent combination. If you’d like to grab coffee or lunch in the SFBA, or chat over Hangouts/Skype/etc, let me know. The best ways to reach me are email — first.last @ gmail.com — or Twitter — @jhscott.