Plutora Blog - Business Intelligence, DevOps, Digital Transformation, Release Management, Value Stream Management
Failure Metrics in Depth: MTTR vs. MTBF vs. MTTFReading time 13 minutes
Today’s post features a detailed comparison of three DevOps metrics that are vital for enterprises and tech organizations in general. They are, respectively, MTTR, MTBF, and MTTF. MTTR stands for “mean time to repair.” MTBF is the acronym for “mean time between failures,” and finally, MTTF means “mean time to fix.” They all sound very alike. All three of them indicate a certain length of time. Also, two of the three metrics have something to do with reacting to some problem or issue, while one of them refers to the time between said issues.
So how different—or not—are these metrics from each other? Is there any difference between “repairing” and “fixing”? What qualifies as a failure? These are the kinds of questions we’ll be answering today.
By the end of the post, you’ll have a better understanding of each one of these metrics, as well as of the importance of the metrics themselves. Let’s dig in.
Don’t worry—we’ll soon be covering the three metrics as promised in the post’s intro. However, it makes sense to first take a brief diversion to talk about a topic that’s important for most of the metrics we’ll discuss: failure. What is failure?
Maybe you think that failure is obvious, and we shouldn’t waste any time discussing such a basic concept. When something doesn’t work, it means that it’s failed. Right?
Here’s how Wikipedia defines failure:
Failure is the state or condition of not meeting a desirable or intended objective, and may be viewed as the opposite of success.
The definition above is short and straightforward, but it leaves room for interpretation and subjectivity. For instance, consider a movie that wasn’t a hit in the box office, despite being acclaimed by the critics and even gaining some important awards. Was it a failure or not? What about a smartphone with a beautiful and intuitive UI and a vast offering of useful apps but laughable battery life? Has it failed?
You might think I’m pushing it a little bit with these examples that aren’t related to the world of software or corporate IT, but I ask you to bear with me.
For instance, imagine an application that, despite giving correct results, has terrible performance. Though other people may disagree, that’s a failure in my view. It doesn’t really matter that the underlying APIs work perfectly; if the user’s experience is catastrophically poor, the software has failed.
We could go further and claim that a faulty UX is also a failure. Especially when it comes to web applications, having a defective UI and a poor user experience overall will lead consumers to your competitors, resulting in bad word of mouth and loss of revenue.
However, as you know, this post is about metrics. That means we must find a way to put subjectivity aside and come up with a measurable definition of failure. When it comes to the technology world—and this is especially true when it comes to the hardware, infrastructure, and operations side of things—the most popular way to measure failure is through system outages or downtimes. Even though there are forms of failure that don’t necessarily result in full-blown system outage, downtime is a useful synonym for failure due to its objectivity. It’s binary—a system is either online or not—and measurable. As you’ll see, hours of downtime is the foundation for the more involved metrics covered in this article.
The Importance of Metrics
You might have heard or read the saying “what can’t be improved can’t be measured” or some variation of it. Often attributed to management guru Peter Drucker, the saying succinctly points out that, in order to improve in whatever endeavor you’re currently taking part in, you need objective indicators that you can keep track of and improve.
That’s true in a lot of fields, and IT is certainly no exception. Quite the opposite, in fact. Organizations around the world make heavy investments in their technology infrastructures. So not employing metrics to monitor and improve said infrastructure is literally not caring about ROI.
MTTR vs. MTBF vs. MTTF: Let’s Cover Each One in Detail
After talking about the importance of defining failure in an objective way and explaining why keeping track of metrics is so essential for tech organizations, it’s time to finally cover the metrics in the post’s title.
For each one of the metrics, you’ll learn
- its definition;
- why you should care about it; and
- how to calculate it.
MTTR (Mean Time to Repair)
According to Wikipedia, “[m]ean time to repair (MTTR) is a basic measure of the maintainability of repairable items. It represents the average time required to repair a failed component or device.”
The crucial word in the definition above is repairable. Items that are not susceptible to repairs don’t fall under the umbrella of MTTR.
MTTR reflects the time it takes an organization to react to unplanned incidents and put their gear, equipment, and devices back to work again. This metric calculates the time passed from the beginning of an incident until the moment it’s solved.
Why Should You Care About MTTR?
This metric is very valuable because by tracking it, you kill two birds with just one stone, so to speak. First, as we’ve already mentioned, the metric indicates how fast your organization responds to problems thrown its way. So the smaller this number, the better for your company.
MTTR often includes the time to do the following:
- Notify the relevant repair workers.
- Diagnose the problem.
- Repair the problem.
- In the case of physical equipment, allow the equipment/devices to cool down.
- In the case of physical equipment, reassemble the device and make it ready for use again.
- Finally, set up and test the device.
The next section will cover how to calculate this metric.
How to Calculate MTTR
Calculating the mean time to repair is simple. First, you find out the total time spent on unplanned maintenance for a given asset (e.g., a specific device). Then you divide that number by the number of failures that happened with that equipment over a specified period of time.
Let’s consider an example.
Imagine your organization spent 64 hours repairing a given device. Such equipment has broken down eight times during the year. That means that your MTTR is eight.
What’s a desirable goal for your MTTR? That varies a lot, especially depending on the type and size of the organization. As a general rule, though, an MTTR value of five hours or less is considered a good goal.
MTBF (Mean Time Between Failures)
The next item on our list is MTBF, which stands for “mean time between failures.” That’s an interesting KPI because, like the previous one, it has to do with malfunctioning devices or assets. However, unlike the previous metric, MTBF is all about the devices themselves. While MTTR represents how quickly an organization can react when unexpected problems occur, MTBF indicates the level of quality and reliability of assets.
Why Should You Care About MTBF?
We’ve just defined MTBF. Let’s now cover the main motivations behind this metric. Why should you care about it?
For starters, you track MTBF so you can improve it. Unlike the previous metric, improving MTBF means making it go for a long time between failures, meaning that either fewer incidents are happening or that they’re being solved quicker. MTBF measures reliability. So, by carefully monitoring this metric, you’ll be able to know the expected life span of a given asset.
Once you have this knowledge, your organization can use it to make educated decisions on issues such as scheduling of maintenance, inventory management, and so on.
MTBF, along with other KPIs, can also help your organization evaluate its own monitoring capabilities. If your organization already adopts monitoring tools and mechanisms, keeping your MTBF high shouldn’t be so hard. If you’re actively tracking this metric and managing to keep it under control, that’s a sign your monitoring capabilities are probably healthy.
How to Calculate MTBF
Calculating MTBF is also simple. First, you find the total number of operational (online) hours for a given asset over a given period. Then you divide that number by the number of failures that happened over the same time.
So imagine that a given piece of equipment has been fully operational for 1,000 hours over a period of six months. Over the same period, that asset broke down five times. The MTBF for this piece of equipment, then, is 200 hours.
MTTF (Mean Time to Failure)
MTTF stands for “mean time to failure.” In short, this metric refers to the average life span of a given asset.
You might think that MTTF sounds quite a lot like MTBF. And you’d be right, of course. In fact, they’re almost the same thing. The only major difference between them is that MTBF is usually reserved for repairable items. MTTF, on the other hand, is used in scenarios where fixing an item isn’t an option.
We can say that MTTF represents an expectation. It sets the amount of time you can expect a given asset to work reliably until it fails.
Why Should You Care About MTTF?
Since MTTF and MTBF are so closely related, it shouldn’t come as a surprise that their advantages/use cases are virtually the same.
Like MTBF, MTTF indicates reliability. By tracking this metric, you’ll be able to get an accurate estimate of how long a given item works until it breaks beyond hope of repair.
Also, as with MTBF, MTTF helps organizations make informed decisions about inventory management—even including decisions about which brands to buy—and more.
MTTF relies on another very important metric we’re not discussing today: MTTD. MTTD stands for “mean time to detect” and refers to the average time it takes your organization to become aware of incidents when they happen. Bringing and keeping MTTD low is the key to managing all of the other metrics.
How to Calculate MTTF
You calculate MTTF by taking the total number of operational hours and dividing them by the number of assets you’re monitoring. Let’s say we have four pieces of equipment we’re testing. The first one failed after ten hours, while the second one failed after twelve hours. The third failed after six hours, and finally, the last one failed at eight hours. So we have a total uptime of 36 hours, which divided by four equals nine hours. This result suggests this particular asset will need to be replaced, on average, every nine hours.
MTTR vs. MTBF vs. MTTF: A Verdict?
If you take away a single thing from this post, let it be this: focus on MTTR more than on MTBF, or any other metric. Why?
In short, DevOps is about action. It’s not about measuring for the sake of measuring. For DevOps, metrics are only useful when they’re actionable—that is, when they help your organization make decisions and fix problems.
In that light, MTTR is a way more attractive metric. MTBF and MTTF, as you’ve seen, are more focused on figuring out the expected life span of assets. You could say that they’re more passive.
MTTR, on the other hand, is all about action. It’s an incentive for your organization to go out there and fix whatever’s wrong. That’s absolutely the attitude you want from an organization that performs DevOps.
Using Plutora Analytics to Visualize MTTR and Other Metrics
Now that you understand the similarities and differences between the metrics and you’re ready to start adopting them in practice, the next step for you is to consider tooling.
Plutora‘s solution is a complete value stream management platform. Besides helping your organization manage remote teams and create a single source of truth to streamline your software delivery process, Plutora also shines when it comes to data visualization. By leveraging its Key DevOps Metrics, you can easily visualize not only the “mean time to recover” metric, but also the below:
- Deployment frequency: how often are you deploying to production?
- Change failure rate: how often does a deploy result in failure?
- Lead time for changes: how long does it take for a valuable change to reach the final user?
The following image can give you an idea of what those metrics actually look like:
Besides the built-in metrics, Plutora also offers a powerful visualization engine, allowing you to create however many custom stream value management reports you need.
In this post, we’ve compared three important DevOps KPIs: MTTR (mean time to repair), MTBF (mean time between failures), and MTTF (mean time to failure). However, before going there, we took a step back to define failure and explain the necessity of metrics as a form of insulation against uncertainty and subjectivity.
So what about the metrics themselves?
MTTR is all about your organization: it indicates how promptly it reacts when unexpected problems happen. MTBF, on the other hand, is more about your assets. It indicates the expected time it takes for a given, repairable item to present issues. MTTF is closely related to MTBF, to the point of being mistaken for it or used interchangeably. The two metrics are, in fact, almost the same, with one major difference. While MTBF refers to repairable items, MTTF refers to the average life span of a nonrepairable asset. So you’ll use this metric for things that you can’t replace.
As you’ve seen, the most valuable metric, from a DevOps point of view, is the mean time to recover. While the two other metrics are certainly useful, they’re more of a contemplative nature. MTTR, on the other hand, is all about action. Tracking it and working on improving it are incentives to bring down the time it takes your teams to put systems back online when things go sour.
Why are these metrics so valuable?
The answer is twofold. For starters, the metrics are valuable for obvious reasons. By keeping track of these indicators, we’re able to improve them. There’s an inevitable degree of subjectivity when it comes to defining failure and success, and IT isn’t immune to that. Metrics are insulation against said subjectivity.
Depending on how you choose to do it, that can cause a chain reaction that benefits the whole organization. For instance, as we’ve already mentioned, MTTD (mean time to detect, a metric we didn’t mention here) might be considered the base of all of the other metrics. That makes sense: if you want to be able to react to problems quickly (MTTR), improving the expected life span of your assets (MTBF) then being able to quickly detect incidents are of paramount importance.
But there’s another equally important but more subtle aspect. Just being able to track those metrics accurately and efficiently means your organization has great monitoring capabilities. In other words, KPIs often double as an indicator of the health of your organization’s monitoring capabilities.
Thanks for reading.