Plutora Blog - Release Management
How to Measure Release-related DowntimeReading time 8 minutes
Software releases are the single biggest factor contributing to downtime across all industries. Almost without exception whenever you hear of a high-profile outage with a bank or an airline it is almost always related to a software release or a high-risk change being made to a system.
While organizations can put quality checks and governance gates in place to prevent downtime there’s always a tradeoff between agility and risk. The most reliable software system would be one that experiences no change at all, but with businesses under pressure to keep up with the competition and deliver more features to customers faster the idea of keeping a system static is unrealistic.
If you deliver software to production you understand that releases are risky, and in this post we’re going to provide you with some steps you can take to account for release-related risk. It may be impossible to reduce your release-related risk exposure to zero, but if you stand up a rigorous approach to tracking release-related downtime it will make it easier to justify greater investment in release management.
1. Track Key Metrics
Businesses keep track of expected revenue and activity. If you’ve been in business for longer than a few months you have some expectation for how much customer activity you are going to experience on any given day. If you are a bank you should be tracking the number of customers logging into account management services, performing transfers, and making deposits. If you work in e-commerce you will be tracking orders per minute and other key metrics such as site traffic.
What Motivates Your Business? Is it traffic or revenue? Are you more focused on traffic and social activity? Or, do you run a business that tracks dollars? These are the questions you’ll have to answer as you instrument your systems to produce real-time data about key metrics. There are some obvious numbers for web sites that should be tracked such as page views per second and error rates for critical systems.
2. Build a Model of Expected Traffic and Revenue
These metrics will vary over time so you need to build in a model for how traffic changes over a day, week, and year. Here are some of the variations you’ll need to build into your models:
2a. Account for Seasonal Variation
Most businesses follow a seasonal pattern of activity. E-commerce sites get ready for the holiday season, accountants prepare for tax season, and other businesses react to the annual pattern that is present in a given industry. You should also expect there to be seasonal variations and other factors that will affect traffic. An e-commerce site will be busier in November than it will be in April, and you should look at the last few years to establish an accurate Year-over-year (YoY) baseline for daily activity.
2b. Account for Daily Variation
An e-commerce site offering a good deal may see more or less traffic depending on the promotion and how visible it is to customers. When you are establishing your baseline traffic patterns you need to account for the general day to day variation you expect to encounter. This means that your expected traffic or revenue in any given day isn’t a number, it is a range of values based on your experience.
2c. Account for New Business and Growth
As your business expands and offers new products and services and as your customer base grows you are going to have some percentage increase in your business year over year. You are also going to experience differences from one quarter to another. Newer systems may experience an increase in traffic as you transition from legacy systems to new systems, and you should account for these differences in your model. You should also account for marketing campaigns and other activity that may affect your volume of business.
3. Establish an Estimated Baseline & Evaluate It Against Reality
Once you understand how your key metrics vary over different timeframes you can create a model to predict future traffic. Document the assumptions that go into your model. Your model should use historical data over different timeframes to create an estimated baseline for different key metrics.
Instead of relying on intuition and educated “guesswork” you should create a system that generates a number. How many page views is your website expecting tomorrow? What is your anticipated revenue? How many errors do you expect between Noon and 2PM? This is the sort of data that you will use when you have to calculate the impact of release-related downtime.
4. Account for Release-related Downtime
Once you’ve selected your key metrics to monitor and you’ve established both a baseline and a predictive model you are ready to start keeping track of release related downtime.
4a. First, Mind the Gaps
The most mature release engineering teams understand how to perform a release to production with zero downtime. This is done by using multiple clusters of systems and swapping systems in an out of rotation in a complex sequence of events. While even the most complex of systems can be adapted to support downtime-free release processes it requires experience and a rigorous planning process.
Many businesses tend to accept a very low level of downtime during a release. If a database is going to be available for 10 seconds then it might not be worth the time and effort it would take to create a multi-cluster deployment strategy. Alternatively, not every business serves customers with a 24/7 availability expectation. If you run a B2B site and your customers are clustered in a specific region of the world you can schedule releases for off-hours. Sometimes downtime is acceptable.
For most companies this isn’t the case. As businesses continue to expand to support a worldwide customer base the idea of scheduled downtime or even interrupting a 24-hour business cycle for even a few seconds is unacceptable. No one wants to see a gap in your key metrics that is related to your release process, but when you perform a release and suffer release-related downtime this is just what you’ll end up seeing – a gap in your key metrics. Where you expected a key metric to remain constant you’ll see it drop. This is why you need to make sure that your operations teams all always tracking key metrics during a release event. It is often one of the first signs that something is working as expected.
4b. Capture Availability Problems
This is the easy part. When your release processes cause system unavailability you’ll be able to see it as a gap in a key metric. Maybe your release happened at 12AM UTC and you have a graph showing expected traffic to a specific subsystem or set of related subsystem trending to zero for 25 minutes. Maybe the release caused a momentary glitch in a database connection that caused a spike in latency for a specific system. These are the impacts you need to capture as issues to be tracked after a release has been completed.
Shortly after a release you can provide a report to the business on how many minutes or seconds of downtime a release caused and whether this downtime affected customers.
4c. Capture Revenue Loss
Once you’ve identified gaps in system availability you then start the process of relating these gaps to potential revenue loss. Or, if revenue isn’t something you measure, lost trade volume or something similar. Note that I used the term “potential” – revenue isn’t lost when a system fails – it is missing. You are measuring lost opportunity for revenue assuming that demand would have matched the baseline you’ve created.
Some of this work will be educated guesswork, especially if the failure condition was partial.
5. Act, Analyze, Adapt: Incorporate Failure Analysis into Release Planning
Don’t just release software and react to release-related downtime. Go into your release processes with a plan. Take time to understand what your key metrics are for system performance and establish a baseline for revenue in a given day or week. When you release software stay vigilant about your key metrics and capture any changes that may be related to your releases.
Once you get into the habit of tracking metrics during releases and relating problems back to release activities, you can use this data to move your teams toward releases that don’t cause downtime. The goal is to get to a state when releases can happen more frequently without introducing unnecessary risk. If you can establish this “Act, Analyze, Adapt” feedback loop using a tool like Plutora you can take this data and use it to plan future releases. You can identify the teams that cause the most problems and redirect your resources to ensuring that these projects take the time to think about their release processes.