Plutora Blog - Deployment Management, Release Management, Software Development
Disaster Recovery Plan: A Complete Guide for the Savvy LeaderReading time 13 minutes
What happens to your business when disaster strikes? Do you have a disaster recovery plan? Will you be able to keep the lights on, or will you leave your customers, and maybe even your employees, in the lurch?
A disaster recovery (DR) plan is what separates organizations that are successful in the face of crisis from the others. It’s the set of processes and procedures you prepare before a crisis strikes so you can continue your business operations after the disaster arrives. Savvy leaders ensure that their plan is in place ahead of time so that in the event of a disaster their IT operations survive and the business recovers quickly.
Let’s talk about what a disaster recovery plan is, why you need one, and how to create a plan and put it into place.
Why Is a Disaster Recovery Plan Important?
You can’t always avoid disasters. Having a plan ready is essential for minimizing damage and getting things back up and running right away. As a matter of fact, by putting preventative measures in place, some companies are able to prevent disasters from happening in the first place.
But the benefits don’t stop there. Your DR plan is an insurance policy, and like standard insurance policies, it can prevent legal liability caused by data loss or unnecessary outages. When a crisis strikes, you need to be in a position to act decisively. You need to have a plan and a team that knows how to execute it.
Simply put, disaster recovery plans are essential tools for mitigating damage and preventing disasters from halting or hurting your business.
Business Continuity Plans vs. Disaster Recovery Plans
We talked about business continuity plans in an earlier post. While a business continuity plan covers your entire enterprise and your employees, a disaster recovery plan focuses more narrowly on your ability to keep your IT infrastructure in service during a crisis.
That said, your IT infrastructure is critical for delivering your value stream to your clients and supporting your employees. If you don’t have a way for your employees to deliver services to your customers, you don’t have a business. You could make a case for the DR plan being the most important part of your overall business continuity strategy.
So let’s talk about how to put a disaster recovery plan together.
What Is a Disaster?
The first step in planning for a disaster is identifying what it is. What risks does your company’s IT infrastructure face? In early 2020, the entire world faced a new type of disaster: A global pandemic forced companies to close their stores and offices. If a business had the ability to support its workforce from home, it was able to stay open. If it didn’t, it stayed closed for months and faced the possibility of going out of business forever.
All businesses face natural disasters like extreme weather that can cause flooding, fire, structural damage, and power outages. How these events affect your company depends on your infrastructure.
Here are some questions to consider when identifying risks:
- What happens if you lose an office or a data center? Could you lose one (or both) to a ransomware attack?
- Do you rely on cloud infrastructure? Are your cloud systems located in a single availability zone, or spread over more than one?
- Do you have a client-facing system that could be a target for a denial-of-service attack?
You’re not writing the plan now, just identifying risks. Don’t fall into the trap of trying to solve problems as you think of them.
While you’re identifying possible problems, it’s worthwhile to highlight what isn’t, or at least shouldn’t be, a crisis. A disk failure isn’t a disaster. Neither is a circuit outage. They’re component and infrastructure failures. You should have plans and preventative measures in place for them too.
What Are Your Objectives?
Not All Disasters Are Created Equal
When we talk about disaster recovery, we usually talk about fire, flooding, earthquakes, and other, well, disasters. We talk about losing data centers either through extended power outages or outright destruction.
The COVID-19 pandemic served as a wake-up call though. Rather than losing our IT infrastructure, we lost access to our workplaces. The loss was to only part of the company’s infrastructure, but it lasted for months. Companies that had contingencies in place for this kind of limited outage, whether by accident or design, were able to remain at least partially in business. Organizations that didn’t have these contingencies floundered.
In-person meetings and retail storefronts had to stop in the face of COVID-19, but online operations stayed open, and many thrived. Companies were forced to adapt or, in some cases, abandon parts of their business in favor of others.
Here are some questions to ask about your objectives:
- What are your objectives when it comes to recovering from a disaster?
- Do your objectives differ based on the type of disaster you’re planning for? (Hint: They need to.)
- What are the differences between losing an office for a short period of time and being forced to abandon it indefinitely?
Talk to Your Stakeholders
Data that you think you may be able to forgo for a short period of time might be critical for your legal or compliance departments. Talk to the departments you’re responsible for supporting and find out what their specific needs are. Go over the list of potential disasters you put together in the first step with these departments and ask these questions:
- Is your business subject to regulations? How do they affect your data retention requirements?
- Do you need to keep all or some of your data on hand at all times?
- How do your priorities map into different parts of your business?
- Do different departments have different tolerances for downtime? Can you leave parts of the business offline for short periods of time, while others must always be available?
- How long can you afford to lose access to your data and production systems?
- How much business would you lose if your website were down for an hour, a day, or a month?
- What would be the expense of having your employees lose access to critical applications?
Form Your Disaster Recovery Team
Disaster has struck. Who’s in charge? What do they need to do? Who should they call? You don’t want to ask these questions in real time. Form your disaster recovery team now, while you’re still writing your plans.
This team should consist of both IT staff and stakeholders.The people executing your plan should be the people that help put the plan together. This way, they’ll be familiar with what they need to do in advance and will also feel a sense of ownership over the process.
At the same time, including stakeholders in the process gives them an opportunity to critique the plan ahead of time. They’ll also know what to expect and when to expect it. IT managers often feel the need to hide or obscure the nitty-gritty details of disaster recovery. This is almost always a mistake. Stakeholders are your allies if you keep them in the loop.
Once your team is formed, make sure they’re involved with each and every step of the DR planning process.
Can You Identify Your Critical Resources?
After you’ve identified the risks, outlined your goals for dealing with them, and assembled a team, it’s time to identify your critical resources. Which of your systems does each risk threaten? This is an important step in formulating a plan because you need to know what’s going to happen before you can plan how you’ll react to it.
Start with an inventory of your IT assets and review the following:
- Where is each asset located, and how does it interact with the other systems?
- Will a crisis in one office or data center lead to a cascade of issues in another? Hopefully, you have someone on your DR team that has this information prepared in advance.
- Do you rely on SaaS systems?
- What would be the impact of losing your cloud accounting system to a flood?
- What happens to your value stream if your cloud provider loses power in one of their centers?
Do You Have Contingencies in Place?
You’ve identified the risks and mapped them into the system that they threaten. Are there steps you can take to head off some of these risks before they happen?
These preventative measures are the most important part of a disaster recovery plan. In some cases, you can avoid a disaster, or at least a major outage, completely. In either case, you’re laying the groundwork for being able to recover from a disaster quickly with little or no data loss.
The most fundamental, and often overlooked, preventative measure is making regular backups and storing them off-site. Backups protect against equipment failures and can serve as part of a disaster recovery plan.
Backups and off-site storage used to require expensive equipment like tape drives and contracts with an off-site storage company. Now, it’s a lot simpler. Cloud services like Amazon Glacier and Google Cloud Storage make it easy to transfer backups off-site.
Once you’ve covered backing up your critical data, you can look at more advanced—and potentially more expensive—options such as redundant systems. Most server-class hardware is shipped with redundant power supplies and RAID drive systems. These features are necessary to protect your infrastructure from hardware failures. If your infrastructure lives in the cloud, you don’t have to worry about buying server-class hardware. You’re renting it already.
But what’s your backup plan for when you lose a data center? You need to plan for that regardless of where your primary system runs.
Depending on your infrastructure, you may already be running in two colocation facilities, two cloud regions, or two different cloud providers. What would happen if one of them went offline? Does your infrastructure “heal” itself, or do you need to intervene?
Or, say you’re running in only one facility because you want to save money or need to serve only a specific geographic area. How quickly could you bring a new system online? This is one of the areas where Plutora can help. You need a deployment plan that you can execute quickly and efficiently on a “fresh” system.
Identify Your Tools
Outlining your contingencies naturally leads you to the next step, which is identifying your tools. Here are some questions to review:
- What tools do you need to restore your IT infrastructure after a disaster?
- How will your team communicate during a disaster?
- Will the tools you use day-to-day work in a crisis?
- Are your normal communication channels usually filled with a lot of chatter? Do you need to set aside a dedicated chat room or Slack channel for emergency communication?
Back in the Objectives section, you documented how quickly you need to recover your critical systems. Do you have the right tools in place for that? If not, it’s time to look at an upgrade.
Restoring systems from backup requires access to the files and the proper software to load them. Rebuilding servers requires the operating system software. Proprietary software may require license files or keys.
It may be time to introduce new tools and processes. Are you using continuous integration/continuous deployment (CI/CD)? If you’re not, now might be a good time to take a look. With a resilient CI/CD system, software deployments are routine, even to a fresh system in a new cloud provider or colo facility.
Write Your Plan
Now it’s time to connect the dots. Put together a plan or a series of plans that cover what you’ll do in the event of a disaster.
But creating a team and discussing what you’ll do in the event of a disaster isn’t enough—your plan needs to be documented, distributed, and tested. If only a few people understand the big picture, your plan will fail.
This means distributing the plan far and wide. Don’t bury it in a folder on a shared drive or attach it to an email, fire, and forget. Do what it takes to ensure that everyone is familiar with the plan, including implementers and stakeholders.
One way to accomplish this is to run regular tests.
Test Your Plan
Organizations learn by doing. While practice is never quite the same as real life, testing your DR plan regularly is the best way to ensure that everyone knows what they’re supposed to do when a crisis hits. It’s also a great way to verify that the plan will work.
Testing the plan doesn’t necessarily have to mean working weekends or taking your enterprise offline. You can test specific sections of the plan or even simulate outages instead of taking systems offline. Use Plutora’s tools to push a release that simulates a failure and then roll back to the last working release.
Just as it’s important to keep your objectives for recovery in mind, keep your objectives for testing in mind too. Some tests are for the people, and others are for the process.
Revise Your Plan
Continuous deployment and behavior-driven development are two methods that are often praised for using feedback loops to drive process improvements. Use your DR plan tests the same way. Run a postmortem on each test and use the results to refine your plans.
Your DR plan is not written in stone. Update it to reflect test results. When your infrastructure changes, update your plan. When business requirements such as new requirements for recovery time change, adjust your plan accordingly.
Start Your Disaster Planning Today
We’ve covered what a disaster recovery plan is and how to create one.
Start with identifying the risks your business is most susceptible to, and decide how you want to react to them. The importance of setting objectives early in the process can’t be overstated. You don’t want your DR plan to be “we’ll do our best.” You want it to be enough to keep your business going during a disaster and to restore it when the crisis has passed.
Form a team of IT staff and stakeholders. Work with the team on identifying critical business resources. Form a plan that fulfills your business objectives. Finally, test the plan and refine it based on your test results. Lather, rinse, repeat.
Plutora has the tools you need to create and implement a DR plan for your IT infrastructure. Our business intelligence platform has reports that will help identify your critical resources and set objectives for protecting and restoring them.
With our deployment planning tool, you can create business continuity processes with features like optional rollbacks and manual and automated deployments. It also offers DevOps metrics for tracking mean time to repair (MTTR) and managing processes for recovery after critical events, like a disaster.
Get started on your disaster recovery plan now, before the next crisis hits!