MTTD (Mean Time to Detect): Defined and Explained
Apr 5, 2023
You may be familiar with the maxim: "you can't improve what you don't measure." Even though there's criticism aimed at the quote, generally speaking, I believe it's great advice for organizations in the digital economy. That's why tracking and improving its failure metrics is imperative for any enterprise that wants to thrive in the digital age, and in this post, we'll cover a crucial one: MTTD.
MTTD (mean time to detect) is a vital metric for organizations that want to improve their incident management strategies. We'll open the post by defining the metric. You'll understand what MTTD is, why you should care about it, and how to calculate it.
After that, we expand a little, covering how MTTD compares with other DevOps metrics and which are the best practices to adopt to keep MTTD down.
Let's get started.
What Is MTTD?
MTTD means "mean time to detect" or "mean time to discover." It refers to how long it takes for the organization to identify a production problem—such as a system outage. So, you naturally want to keep this metric as low as possible since that would mean your organization is quick to discover issues and, consequently, fix them.
Why Should You Track MTTD?
Why should an organization track its MTTD value? First, you want to fix production issues as quickly as possible since they're the company's money going away. The longer it takes to discover a problem, the longer it takes to fix it.
Also, having a low value for MTTD is an excellent sign for an organization's incident response strategy. It means that the organization has mechanisms and process in place that allows it to find out about problems quickly. In other words, a low MTTD value is a sign of a healthy incident management response.
How to Calculate MTTD
As long as you have data available on incident detection, calculating MTTD is straightforward. You add all the detection times for incidents during a given period; then, you divide this total by the number of incidents.
For instance, consider the following list of incidents for a hypothetical company:
DateIncident startDetection TimeTime To Detect (In minutes)2022-12-018:00 AM8:45 AM452022-12-0311:27 PM11:47 PM202022-12-073:15 AM4:15 AM602022-12-184:06 PM5:16 PM702022-12-236:56 AM7:58 AM62Total 257
The table above shows that, during December 2022, the organization had five incidents, whose total detection time was 257 minutes. So, MTTD = 257 / 5. Thus, the mean time to detect was 51.4 minutes.
So, as you can see, the arithmetic part of calculating MTTD is quite simple as long as you dutifully collect data on all of your incidents—and that's where the challenge often resides.
Also, remember that it's possible to have more complex scenarios than the ones above. For instance, the organization can choose to group incidents by their severity (e.g., low, medium, and high) and thus have more than one MTTD value.
Also, the organization might choose to dismiss outliers from the table, considering those can affect the resulting value. Of course, this generates additional complexity, such as defining the range of values that should be considered outliers.
With the basics out of the way, let's delve deeper into MTTD, starting with comparing MTTD and other important failure metrics.
MTTD vs MTTR vs MTTF vs MTBF
Let's start with MTTR. This initialism can refer to several metrics since the R can stand for recovery, repair, resolution, and response. For our purposes, let's consider the mean time to recovery metric. Mean time to recovery refers to the full time it takes to solve an incident from the moment it happens. In other words, MTTD is contained in MTTR. Thus, keeping MTTD low also improves MTTR.
MTTF stands for mean time to failure. It refers to the expected working lifespan for a non-repairable item, such as a hardware item. In the context of DevOps, MTTF makes sense when talking about hardware items from your on-premises structure.
Finally, MTBF means mean time between failures. This metric is similar—and often confused with—MTTF: it refers to the mean time between failures of a given component. However, unlike MTTF, MTBF is used for repairable items.
Best Practices to Keep the MTTD Down
I hope that by now, you're convinced of two things:
MTTD is a crucial metric for tech organizations
You should strive to keep it down
So, how to go about keeping this metric's value down?
Invest In Constant Training
There's no amount of money or sophisticated tools you can throw at a problem if your people aren't prepared to deal with incidents. So, a top priority should be to invest in training and education regarding incident response, including not only the tools the organization adopts but also—and most importantly—processes. Engineers, ops people, and whoever is contacted during a crisis should be able to answer questions like:
What is the process/who to contact to get the relevant accesses and privileges to troubleshoot a system?
Who are the key people in the chain of command one should scale an issue?
When should you scale an issue, and when not?
Providing training on the proper incident response process assumes such a process, which segues nicely into the next point.
Have a Well-Defined, Well Documented Incident Response Process
It's unreasonable to expect people on call to "wing it" when a crisis occurs. There needs to be a proper incident response process that is well-documented, easily accessible, and always up-to-date.
Adopt Best-of-Breed Monitoring Solutions
Another vital part of a winning incident response strategy is adopting best-of-breed tools for monitoring and alerting. After all, if you want to keep your mean time to detect value as low as possible, you need to be able to detect issues and, just as importantly, alert the responsible stakeholders.
Use Blameless Post-Mortems After Incidents
Every time an incident happens, there's a learning opportunity. Organizations with great incident-response processes aren't those where a culture of finger-pointing reigns. Instead, these are the ones that conduct blameless post-mortems after each incident to answer questions like:
Why did this issue happen?
How could we have prevented it?
How could we have detected it earlier?
The goal here isn't to find people to blame but to identify points in the process that can be improved so that in the future, the organization can prevent the issue or, at least, detect it and fix it sooner.
Identify Trends and Act Preemptively
By tracking MTTD and other metrics over time, you can notice trends that should prompt you to act. For instance, if you notice your MTTD has been increasing over the last quarter, this should trigger an investigation. What is the reason behind this increase? Did the detection time go up, or more incidents occurred during that period?
Understanding this is the first step; after that, the organization needs to step in and take measures to revert that trend.
In the ideal world, incidents wouldn't happen, and no one would need to wake up at 2 AM to look at a computer screen and figure out what is causing micro-service x to stop working. We live in the real world, though, and accidents happen much more than we'd like to.
After learning about MTTD—and other KPIs as well—a logical follow-up would be to consider tools you can leverage.
Plutora is a value stream management platform. It helps organizations create a single source of truth; that way, they can streamline their software delivery process. It also shines regarding data visualization. You can leverage Plutora‘s Key DevOps Metrics to obtain insights that will help you improve your organization's incident response strategy. To learn more about Plutora and the most important software metrics, watch the Beyond DevOps and Flow: The Who, What, and When for Software Delivery Metrics webinar.