In an industry that places such emphasis on speed to market, Mean Time To Repair (MTTR) may be the most important metric you’re not paying attention to.
Mean Time To Repair (MTTR) can be defined as the average time elapsed between an incident occurring through to the underlying issue being repaired.
“The most relevant metric in evaluating the effectiveness of emergency response” - Site Reliability Engineering - How Google Runs Production Systems, by Google
Waterfall Development and inherent risk
To understand the importance of this metric, we have to go back to a time when Waterfall development was the industry standard. In such an environment, development cycles upwards of a year were common, followed by almost as long spent in the test phase. Turnaround times on new features were constrained by a process that was set in stone. Silos were commonplace, with separate disciplines responsible for their deliverables. These usually became someone else’s concern, the moment they passed through a quality gate.
In such an environment, production deployments were a rare event and wide in scope. It wasn’t unusual for an entire application to be released in a ‘big-bang’ deployment. The stakes in this setting were understandably high. Through necessity, development teams tended to adopt a risk-averse approach. Deployments were carried out with the help of the Operations team in the middle of the night. I witnessed many such deployments, and they weren’t always successful, occasionally resulting in the dreaded roll-back.
With experiences like this, it’s a natural human response to associate deploying to production with risk. No-one wants to be responsible for breaking everything, which at the time felt like a realistic proposition.
“The real test is not whether you avoid this failure, because you won't. it's whether you let it harden or shame you into inaction, or whether you learn from it; whether you choose to persevere” — Barack Obama
Agile process and embracing risk
Times have moved on, thankfully. At Red Badger, teams deploy multiple times per day with the click of a button. This is usually the responsibility of an empowered QA Engineer or Developer with no sign-off process necessary.
So how have we come so far, and how have we overcome our fears of deploying to production? Spoiler - it’s not by being perfect.
A major factor has been a reduction in our MTTR from project to project. It’s difficult to say whether this is the cause or effect of our release cadence. Either way, I believe that focussing on reducing this metric can be a real enabler towards moving away from a risk-averse approach.
Reducing your MTTR
One way to reduce your MTTR is to focus on improving the observability of the application under development. By ensuring that an appropriate level of logs, traces and metrics are added to the codebase, debugging any issues that may arise becomes a much easier task. I would encourage both Developer and QA involvement in the formulation of an effective monitoring and alerting strategy before any code is written.
The rise in DevOps over the past decade and, more recently, the advent of the Site Reliability Engineering (SRE) discipline places the responsibility for observability at the feet of developers, or those that we traditionally think of as developers. I would argue that this shouldn’t necessarily be the case. I see no reason why this area shouldn’t be shaped by discussions between Developers and QA Engineers if it leads to a shared understanding of the application and its internal health. Agile methodology promotes the idea of accountability amongst the team, so it seems outdated that only one discipline is responsible for an entire area. I would take this further and advocate for QA Engineers to be included in any production support rotas, as a means of sharing the load, but also to gain a better understanding of system health beyond acceptance criteria and test cases. Ultimately, the more people who can gauge the health of your system, the better.
Once an established observable platform is in place, the team can start to put it to good use - gaining fast feedback from deployments and spotting any underlying issues that might manifest. Establishing baseline metrics, such as average request latency or error rate, can be extremely useful in spotting any degradations over time. Where real thought has been put into observability, and in particular alerting, it’s possible that issues could be identified and resolved before any of your users are even aware there’s a problem.
The second way to improve your MTTR is to increase the frequency of releases to production. This is a challenging concept as it runs counter to our instincts - doing more of the thing that, historically speaking, we’ve grown to fear. This step works for two main reasons though. Firstly, by releasing more often, fewer lines of new code are being deployed each time, usually reducing the associated risk. Secondly, the difficulty with which a Developer or QA Engineer can unpick an unsuccessful release is lowered, as there are fewer lines of code to sift through to find an offending bug.
I mentioned, earlier in this piece, the calculation involved in determining your MTTR. While this calculation will give you an overall figure, it is not representative, in that it takes low priority issues into account, which may have been deliberately left on the backlog. It may be that you are more concerned about incidents that result in any form of customer downtime, in which case a more useful measure would be:
What this number gives us is a quantifiable measure of risk mitigation. This is important when looking at the risk vs reward of a regular deployment. For example, would you be comfortable releasing to production daily if your MTTR was 10 days? Probably not. How about if this figure was under an hour? It depends on the business case, but at some point, the reward will far outweigh the risk when the risk is mitigated so effectively.
When we say that we embrace risk, it is with this in mind. It’s not that we choose to be reckless, it’s that we build risk mitigation into a repeatable, well tested process. When this is the case, embracing risk becomes a more attractive proposition.
Releasing with regularity - a new normal
When releasing to production with regularity becomes the new normal, a new fast lane is created that never existed in the past. It’s a common misconception that compromises are made in speeding up the release process. We achieve this by releasing smaller amounts of fully tested code, with the added protection of feature flags where possible. In times of crisis, teams should be able to deploy through their regular release process without cutting corners, rather than throwing all process out the window to rush a hotfix.
MTTR can be an excellent gauge of operational efficiency and responsiveness, so should be tracked over time. A focus on reducing this figure can lead to a reduction in incidents as the release pipeline matures, while shortening the duration of any such incidents. The result of this is a streamlined process which provides a level of agility in product decisions that didn’t exist in the past. Want a new feature released to beat a competitor to market? The team can now do this without the need for a week-long regression testing period and an overnight deployment window. And will it be scary when it’s deployed? No, because most of it will already be in production. The last chunk of code will go in like the last piece in a jigsaw, like all the rest of the pieces that you’ve been laying down day in, day out.