I learned it all by sitting on the couch.
I’m sure there’s millions of people out there who will tell you that running an Ironman changed their life. I’m not one of them. Well — I am, but not in that evangelical gung-ho must recruit you way. Running an Ironman absolutely changed my life in the way that I understand DevOps, Technical Debt and systems in general.
Running an Ironman is very similar to being at the front-line of DevOps. There’s a culture that you have to adopt, a set of tools that you need, and everything but everything is governed by metrics. How far did you run today, what calorie intake did you take in what sort of proportions, did you sleep and for how long, are you organized and ready to go for the next morning? Your culture is one of going it alone, but helping people out and ensuring that you are prepared and ready to provide help to others in your world.
The tools we use are different — but they fall into the DevOps world — we all have monitors on our wrists — yes we call them Trackers, but it’s effectively statsd for humans. We log in on our dashboards every day to verify our systems are running how we expect. We have alerts set up to ensure that if things fall out of bounds that we can address the issue — my equivalent of PagerDuty as a triathlete was my wife and children. “Daddy you are getting grumpy, go run!” was heard more than once in our household.
We even have a uniform — trust me, slick body-hugging lycra is about as attractive as the faded worn cat t-shirts with cheeto stains under a baggy hoodie.
And sure, I can tell you all the things that I did in triathlon that equate to things that we do in the DevOps world. But that’s not the real lesson.
The real lesson is what I learnt when I STOPPED being a triathlete. When I achieved my goal of 99.999% Uptime. I was happy — I was at peak fitness — I weighed 160lbs, ran 8minute miles, ran a half-marathon weekly just to get to work…. I was unstoppable. So what?
Physically when you reach a goal like an Ironman, there’s this sense of achievement, this sense of a job well done and a euphoria. And for me, I need to rest. But if I rest too much I get into this post-race despondency, and I sit on the couch for 8 months eating beer and drinking pizza. The best practice of training daily, running to work, eating healthily goes out the window. And this has taught me the value of our DevOps practice and the tech best practices.
There are many habits and attitudes that we adopt in the pursuit of our goals. As a triathlete I kept logs of my calorie intake, my miles run, my rest days, my hydration and my sleep. I know that if I don’t do the training set out for me, then I will slide and let things get out of hand and that my next run will be less productive or worse, I will not actually do the run, or swim, or ride and will slide into neglect.
We as DevOps practitioners know the value of observability, the oncall rotation, maintaining configs, continuous delivery, blameless post-mortems, code reviews, collaboration, education and a whole host of other things. We know the importance of continuous checking because of what we laughingly refer to as bit-rot. We know that if we leave things alone, then we will forget how to maintain them, that they will be used for purposes other than those for which they were intended.
We know that once we have set up our tools, that we need to continually use them. We practice every day, and then once or twice a year we have a big practice that we refer to as DR or OpsGames or Incident Practice (I wish we had enough freedom from incidents to do practice!).
So what happens when we don’t do those things? What happens when we metaphorically go sit on the couch and drink beer for 6 months?
The first couple of weeks nothing changes — we relax, and we don’t worry about what’s happening, because it’s just a couple of weeks — it’s a sprint without all that hard maintenance work. Yay! it’s like vacation but you are working on fun stuff.
The next couple of weeks we kinda check in on the monitors and make sure that they are there — we double check a couple of PRs, but nothing major. It lulls us into the sense of security that what we have built is robust. We even let Devs do things that we wouldn’t normally allow because it’s not a big risk, and we can always go back to being hard task-masters.
A month or so goes by, we’re now doing what we want to do, relaxing, chilling, writing some automation piece for some new tech that’s being built, but we aren’t really putting in the time on maintaining the existing stuff. Someone releases something, and it breaks our “legacy” stuff, and we jump up and put out the fire. No biggie — all the other systems are fine, this one is problematic, and anyway the automated monitor caught it, so nothing to do, right? And on we go.
Six months later we have more incidents per week than we did before we sat on the couch. “Nothing’s changed” except that little incremental changes across the stack that we haven’t thought about have bigger knock on effects. Some of our best practices have fallen by the wayside because we don’t need to use them all the time. (Admins are now approving PRs as a matter of course, SREs aren’t being asked to look in on all deploys, QA isn’t checking every little thing because the automated tests are passing, but they haven’t been updated in 6 months etc. etc. etc.)
It is clear that the effort to put in place the right practices takes time. It also decreases initial productivity for the team, but increases overall productivity and reduces maintenance costs long term. The adoption of these practices reduces incidents, allows for faster response times, better understanding of the systems and better overall system health. And bear in mind that system here doesn’t just mean the hardware, but also the people and the organization as a whole.
Knowing that, why is it that so many of us forget to keep working at this? That once we have the tools in place, we assume that they will keep on running and that the systems won’t change.
example: One organization had a tools department that was asked to create a set of jenkins jobs to create and destroy EMR Clusters on the fly. This was one of those “drop everything and create it now” projects, but the group pushed back and got it properly defined and put in place. The team creating the jobs tested and made sure everything was working. The team receiving the jobs also tested and ensured that they could create and drop clusters at will with their configs etc.
Everything was peachy keen and worked swimmingly. Of course there were a couple of iterations where the teams ironed out issues, but once everything was sorted, both teams were happy and it all worked well.
The delivery team was available to help fix infrastructure and jobs when needed. Over time there were fewer issues to fix. And then none. Which the delivery team took as a great thing. Clearly the receiving team was on top of things and everything was now stable. 6 months later after some attrition, there was an incident. A new EMR cluster needed to be stood up stat!… Annnnnd everything failed.
During the PostMortem it became clear that the scripts jobs and clusters that had originally been handed over had worked great. But even though the stated goal was “on demand” creation and destruction of these clusters, clusters had never been destroyed. New clusters had rarely been created and after the attrition everyone just used the running cluster. The receiving team made changes that were local to that cluster. In fact all changes were done locally because someone had not passed on knowledge during the attrition. Yes documentation existed, but a Single Person of Failure meant that no-one knew.
The fact that no-one then actually searched documentation was really just a matter of prioritization — sure they’d look it up when they needed to and when they could afford time. But it was never the best practice on the team to recheck and retest or figure it out. When the incident came, it took 3 days of iterations back and forth to get the jobs and scripts back into working order. Yes, the system restoration took less than a hour, but the entire thing was 3 days.
What we could have done to avoid this is simple. Measure the number of times a cluster is setup or discarded by the jobs, putting reminders in if things got delayed or there wasn’t at least a test run every week… This is simple monitoring. Ensuring regular check in and SLO adherence. Doing test cycles. All of this would have helped. It might not have avoided the incident, but it would have avoided wasting 3 days of engineering effort — approximate 2 mythical personweeks …
What happens now? Well, now I start trying to run and swim and bike again. My run times are around 10:30 rather than 8:00 — so I’ve got a struggle ahead of me. But, I know what the work looks like and feels like. I know the results and I know how fit I can be. I know that once I am back to that level, that I can keep going.
On the Resilience front, the path is equally arduous. Where we once had to figure out what needed to be done from scratch, we have to do the exact same thing again. We plan things out the same way — milestones, improvements. We put other things on hold, we regenerate our tooling and our practices. And over time we get closer to where we were, but we effectively have to restart — and that means getting all that buy-in again from other teams. What’s worse is that SRE/Tooling/Resilience whatever you call that team, will likely have to do this for other teams who don’t manage their own best practices.
So how do we solve this:
- Systems that support production must be monitored and measured just like production
- Everything that Resilience builds is a system that supports production
- Put in place a documentation and continuous testing practice
- During hiring and attrition ensure that team members are brought into this culture
- And if you think that this sounds like Kaizen for specific teams — then you are right.