In the last week, Lorin Hochstein has posted five new posts on his blog. What makes this particularly exciting is that, as far as I know, Lorin is one of the few full-time resilience practitioners working in software. He is a member of Netflix’s Cloud Operations and Reliability Engineering (CORE) team. Surprisingly, not much has been written about CORE. Perhaps the most insightful text is their Senior Resilience Engineering Advocate job post:
The goal of the education and outreach function of the SRE team is to improve reliability and resilience of Netflix services by focusing on the people within the company, since it’s the normal, everyday work of Netflix employees that creates our availability. To cultivate operational excellence, we reveal risks, identify opportunities, and facilitate the transfer of skills and expertise of our staff by sharing experiences.
Following an operational surprise, we seek to understand what the world looked like from the perspective of the people involved. We facilitate interviews, analyze joint activity, and produce artifacts like written narrative documents. Relationship building is a huge part of this role. Someone advocating for resilience engineering within Netflix will help stakeholders realize when this type of work is most effective.
We Think About
Netflix as a socio-technical system is formed from the interaction of people and software. This system has many components and is constantly undergoing change. Unforseen interactions are common and operational surprises arise from perfect storms of events.
Surprises over incidents and recovery more than prevention.We encourage highlighting good catches, the things that help make us better, and the capacity we develop to successfully minimize the consequences of encountering inevitable failure. A holistic view of our work involves paying attention to how we are confronted with surprises every day and the actions we take to cope with them.
This is not the sort of language you typically see in the job description for an SRE! Lorin’s colleague Ryan Kitchens also gave a talk at SRECon19 that touches on CORE’s approach.
Back to those blog posts
Many of the topics Lorin has written on in this flurry of posts have been covered before by folks like Allspaw, Cook, Dekker, or Woods. Lorin’s writing is a great counterpoint to these voices and I think well suited for those new to resilience. He writes plainly and smoothly, and with the perspective of someone who has been in the trenches doing the work for some time.
Lorin’s recent blogposts cover:
- A proposal of a new “Gamma Knife” model of incidents to center the role of contributing factors
- His view that narrative is the most important part of postmortem documents (or, in CORE lingo, “operational surprise writeups”)
- His perspective on lack of control and hindsight bias in complex systems
- How to use incidents to reveal the context/tribal wisdom sitting in the heads of your organization’s experts.
- Why you might want to look forward instead of backward when reviewing incidents
I look forward to more blogs from Lorin, even if he ends up slowing down a bit. Lorin is also not the only resilience practitioner blogging nor do they all work at Netflix. I hope to highlight others’ great work in future blog posts.