Resilience Engineering

6 min readJul 28, 2019

Edit: John Allspaw pointed out that this isn’t a post about the field of Resilience Engineering. I have added links at the bottom of the post to a lot of links about the field of Resilience Engineering. I am also including here at the top of this post, my working definitions and some context as to the origins of my point of view and post.

Resilience : The capacity to prepare for disruptions, recover from shocks and stresses, adapt and grow from a disruptive experience.

Engineering : The systematic application of scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of software.

Resilience Engineering : The design, implementation, testing, and documentation of software to prepare for disruptions, recover from shocks and stresses, adapt and grow from a disruptive experience

The thought train that brought me to write this post was that we use team names in very odd ways in the software engineering space. We love code names, we embrace obfuscation and jargon even while we rail against it. As I state in the post: “The naming of any team is fundamental to the team’s identity. It is an expression not only of the team’s culture, but also the functionality of the team.” This post is about defining a team identity and outlining the roles and responsibilities for that team within a larger software development organization.

I’ve been talking about this Resilience Engineering thing for a while now. I rebranded my company’s “Technical Operations” and “DevOps Team” and “Release Engineering” teams to Resilience Engineering 2 years ago. The purpose was two-fold. Firstly, we were integrating two different organizations (just the latest in a long line of acquisitions of which I have been part). This meant that we needed a common identity to move forward. Secondly, SRE has specific connotations, Operations has specific connotations, and neither of those underlying meanings drive towards a cultural change.

Don’t get me wrong, SRE is a hugely valuable function, and in fact, our team members are SREs. DevOps is a very important mindset and organizational shift, and is at the forefront of creating sustainable, resilient engineering practices within modern organizations. The issue here though is that SREs are individuals, and DevOps is cultural. What is the team that is comprised of these individuals, promotes the DevOps cultural practice, and what function does it fulfill?

The naming of any team is fundamental to the team’s identity. It is an expression not only of the team’s culture, but also the functionality of the team. Team naming is a corollary of Conway’s law. The team name may initially reflect the team’s approach, but over time, the team’s approach will be a reflection of the identity. I’ve seen plenty of teams with cool codenames. And the organization has a hard time figuring out what those teams do. And on top of that so does the team. e.g. a team called Cheetah gets named because they are super fast at producing new feature sets. The wider engineering organization doesn’t know exactly which part of the distributed system that Cheetah works on but they do know that the team works quickly. Over time, the team is given different work that needs to be done quickly. They have that reputation. Given a longer time the Cheetah team is turned into a hack team that will quickly produce a given feature for whatever part of the system they are asked to. But, that team is geared only to work at breakneck speed, on whatever random piece of code is presented, and without much care for the external organization other than the deliverable.

Liz Fong-Jones and Seth Vargo did a great series on DevOps vs SRE. Right in the first video they point out that class SRE implements DevOps. i.e. that SREs are the people that operate using the organizational tooling and approaches in DevOps. For more information then I would suggest watching the full series.

Using SRE as the team name focusses the team on reliability. This is a realistic goal only to a certain point. The Site Reliability Engineering team of any organization is seen to be the team that keeps things up and running. They are responsible for the site being reliable. But the reason that most things fail in production is not a given machine dying. Rather it is likely that code has been exercised beyond some limit. This means that the team responsible for Site Reliability is now in no position to fix the problem.

Resilience Engineering, however, has a focus on Resiliency. i.e. self-healing, fault tolerance, easy restoration of service. There’s a mandate in the name Resilience Engineering, and that mandate is to help advise engineers on how to introduce techniques into their services that enable these three properties. Resilience Engineering as a team identity also creates a focus on the team being resilient. It is clear to members of Resilience Engineering teams that the system is not just the code and machines, but also the humans that operate the system. Additionally, Resilience Engineering is an Open and Collaborative group. It requires the interaction between Dev and Ops and Security. The group thinks about problems that might affect the system, not just problems that are affecting the system

If we do not build into our practice the ability to support the system easily (documentation, easy to use toolchains, good observability and accurate alerting), then the human element can easily become burnt out. And if the human part of the system fails, then the system itself fails. Therefore, SREs on a Resilience Engineering team are aware of the need for good knowledge sharing. They are proponents of good documentation and current run books. They are also aware that even when they are paged at 3am, getting a good solution together to better improve the overall system is a priority.

Netflix has long championed Chaos Engineering. You can read more about it here. Resilience Engineering implements the tools and practices that Chaos Engineering develops and promotes the design of software to be able to proactively adjust to the experiments and devastations that Chaos Engineering will cause. A good way of looking at this is that Chaos Engineers find the vulnerabilities and problems in a system before they exhibit in the wild. Resilience Engineers design and educate so that these sorts of issues won’t show up at all. Just looking at combining the definitions of resilience and engineering is enough to understand the purview of this team: Resilience Engineering can be simply defined as the design, implementation, testing, and documentation of software to prepare for disruptions, recover from shocks and stresses, adapt and grow from a disruptive experience

It is important that Resilience Engineering works with Developers, Quality Engineers, Chaos Engineers etc. There’s no clear delineation of responsibility here. We are all responsible for the resilience of the systems (even down to marketing, who can give heads ups of forecasted expected traffic numbers for given campaigns for example). And if it is this collaborative and requires this involvement, why do we advocate for Resilience engineering groups? Resilience Engineering Groups as stated above, lead the effort. We are the pointy end of the spear, the thin end of the wedge, and we have a responsibility to share and educate so that the entire system — human as well as machine, can adapt to adverse conditions.

For more information please see see below:

The Difference between Reliable and Resilient Software — CabForward
Resilience Testing — usersnap has some great examples including Netflix’s Simian Army
Software resilience engineering helps teams quash chaos — a brief introduction at techtarget, but includes the mapping to industrial engineering terms
Stanford’s 4th Year course Resilience Engineering — This is in CS410 at Stanford now
Ben Christensen’s excellent Velocity talk back in 2013 — Things have moved on, but this is a great introduction and jumping off point on the way in which 6 years ago this was being thought about
Resilience Engineering as an IT Cultural Discipline — cognizant whitepaper
Netflix Senior Software Engineer in Resilience Engineering job post — a good job description including the focus areas

Edit: Added links for the field of Resilience Engineering John Allspaw’s twitter account John’s Blog post from 2011 Adaptive Capacity Labs

The Resilience Papers Club — for all things Resilience Engineering, I have a lot of reading to do
DevOps Days talk by John — an introduction
Four concepts for resilience and the implications for the future of resilience engineering — Woods DD
Resilience Engineering: Concepts and Precepts on Amazon — David Woods is also at Adaptive Capacity Labs
Erik Hollnagel’s home page — Erik Hollnagel is the co-author of Resilience Engineering: Concepts and Precepts

While I admit that a good part of my motive is to avoid anxiety, I hope that these edits bring a wider understanding of the field and its practical uses within software engineering.

Resilience Engineering

Written by Caedman Oakley