Observability vs. Monitoring: A Breakdown for Managers
Jun 4, 2021
One of the core things you need in order to deal with computer systems (or almost any system that you want to control, really) is to know what's going on inside them. Monitoring and observability are ways toward that goal.
Monitoring is the more traditional approach, where you retrieve metrics that you've decided you need. Often this is data that has been useful in the past: How much disk space is the database using? How many requests per second is the web server handling? Monitoring focuses on a set of known failure modes. Running out of disk space is a very common failure mode, and monitoring lets you stay on top of it.
However, computer systems can go wrong in new and exciting ways that your monitoring never foresaw. Monitoring might tell you something is wrong (requests are failing, for example). But in order to understand the reasons, you need an integrated view. This is what observability gives you: a holistic view of your system, integrating data from several sources such as logs, metrics, or traces, and the ability to drill down and explore.
Or to put it another way, monitoring gives you data, but observability turns data into information, and that helps you make good decisions. This post will explain how monitoring and observability benefit you and walk you through two examples to illustrate how observability is good for value stream management in particular.
Next, let's look at how observability helps manage the software delivery process.
Why Observe the Value Stream?
It often makes sense to view software development in terms of value streams rather than as a set of projects. The software's value to the customer is at the center of all efforts. Value stream management looks at more than software development and delivery, but let's focus on those for now.
Managing software development is famously difficult. Efforts to apply monitoring tools abound, from Gantt charts to burndown charts or other agile metrics. These are useful tools. But observability can do more for you because it helps you go from seeing that a project isn't going well to understanding what you can do about it.
For example, your metrics may tell you that the delivery date for a feature is in danger of slipping: you can see the burndown graph slowing down or the number of open issues growing. But why is this happening? This is where you need to dig into the data. Of course, you'll talk to the people working on the feature, and many of them will have good insights. But they can also have a siloed view, and you'll want to make sure you use all available information.
Observability for Value Stream Management
Value stream management applies metrics that cover the whole of the software delivery process and your entire organization. Lead times, bug counts, release frequency and other DevOps metrics, customer satisfaction: a good value stream management platform will allow you to view this data together so you can drill down and decide where to take action.
For software delivery, you'll often use value stream flow metrics. These describe the progress that work items make through your software delivery pipeline. Common metrics are as follows:
Cycle time: how long items take to complete from when work is started to when it’s completed
Lead time: Elapsed time from request to release (from customer’s view point)
Throughput: velocity, or the number of work items completed in a set time period
Efficiency: the ratio of work time to waiting time
Work in progress: how many work items are in progress
Flow efficiency: Time that work items are active against their total cycle time
Work profile: proportion of each type of work item delivered in a time period by the software value stream
Many systems group work items by category: feature, technical debt, risk (this is security, compliance, data protection work), or defect (bugs).
Lead time vs. cycle time is a good place to start when you want to know the overall state of your project. It measures how long it takes to complete a work item. With lead time, the clock starts with the customer request. With cycle time, the clock starts when a developer begins to work on the feature. As with all metrics, you need to drill down into the data in order to get true insights.
Metrics Have Consequences
Metrics need interpretation. Raw numbers are meaningless—when's a bug count high, for example? The only way to tell whether a metric value is good or bad is by linking it to your value stream. A high bug count means you're not delighting your customers as much as you could be. Moreover, a high number of bugs represents work that your developers need to do in the future, and that's time they won't be spending on new features.
If you use metrics as a basis for reward or criticism, your teams will adapt their behavior. But they may not adapt it the way you want. For example, if you reward teams for low open bug counts, they may decide to reject bug reports quickly ("can't reproduce," "works for me," "you're holding it wrong"). That will not make your software better. To be clear, closing some bug reports is fine (you do need to prioritize), but teams shouldn't be doing this just because it makes a metric look good.
Let's look at two example metrics: throughput and bug counts. I'll focus on how to drill down in order to make the data meaningful.
Example: Throughput
The work item throughput tells you how quickly your team is closing issues. As with most metrics, this one, too, is meaningless in isolation. For a start, work items have different sizes, and the time it takes to complete one varies. Below is a sample chart showing the number of work items closed per week. I've broken the work items down by type (feature, tech debt, risk, or defect).
What does this graph tell us?
Feature work dominates the work items. That can be bad if the team is neglecting work on the other categories. To figure this out, also look at the total open work items, or at lead and cycle time.
There are few items from the risk category. This can be OK; typical items in this category are security or privacy reviews, which tend to have a long lead time and involve multiple teams, so they may take a while to close.
Why is there a low number of items in week 6? You see a high number of items in weeks 5 and 7, so this could be just an effect of showing counts by week: people wrapped up a lot of work in week 5, then started on a bunch of new items in week 6.
When you dig into graphs like this, often it's useful to look at the average (or median) values first to get an idea of what's normal. But then also look at the extremes of the distribution—say the 10% of work items that have been open the longest.
You may find people arguing about requirements or items that turned out to be much harder to implement than anybody thought, for example. Those are both good places for intervention.
Example: Do We Find Bugs Too Late?
Let's look at another common metric: the number of open bugs. It's quite common to equate a "high number of bugs" with "bad software." As I mentioned above, that can be misleading. With this in mind, let's look at this metric anyway. First, the total of new bugs, or defects, per week.
Whether the counts are high depends on what's normal for your teams. Let's drill down though! Bugs can be reported by automated systems such as integration tests, by employee testing, or by customers. Most bug tracking systems will let you tag bugs with the environment (such as dev, QA, or production) where they were found, so let's do that here.
This chart shows the distribution of defects by environment, and it tells an interesting story: most defects are found in production. Shouldn't developers and testers have found these in the development or QA environments? Speaking of which, what's up with the low number of bugs from the development environment?
Maybe your developers don't always report bugs in the development environment, and instead just go ahead and fix them. You can check this by looking at the defects in detail: if most of the defects in development have high severity or were created through automation like your CI/CD system, it's likely that developers are not even reporting the lower-severity items. You could also ask your developers.
When a lot of bugs are found late in the process, it's often time to invest in better integration tests. Now you have better data to make that decision.
How Can You Try This?
Monitoring is good; it helps you see what's going on. But observability adds structure, meaning, and depth to the metrics. Monitoring drops breadcrumbs, and observability assembles the crumbs into a trail.
I've only scratched the surface of what you can do with observability though.
Which tests fail the most, and does this correlate with development activity? Do you have enough test coverage in the areas that get the most bug reports in production? How much time are your developers spending on tasks that could be automated? Getting answers requires data from several toolchains such as your issue tracker, CI/CD system, and code repository. The common tools export this information, but each in its own way, with its own data model, so pulling this data together can be a challenge.
You can definitely get a good start by just pulling data from your tools and sticking it into a spreadsheet, for example. However, you'll quickly find yourself spending a lot of time wrangling the data, and getting your reports to be repeatable will be a struggle.Value stream management tools help. You'll want to choose one that can pull data from your toolchains and provide you with an integrated view and extensive drill-down capabilities. Plutora's tool does this; you can find out more with a free demo.
Download our free eBook
Mastering Software Delivery with Value Stream Management
Discover how to optimize your software delivery with our comprehensive eBook on Value Stream Management (VSM). Learn how top organizations streamline pipelines, enhance quality, and accelerate delivery.