A few weeks ago I talked about Data in the Cloud. Today I would like to talk about using the Cloud to enhance your disaster recovery capability. Disaster recovery can also be called “business continuance” or in the high-end government space “continuous operations” or “COOP” (pronounced “coop” like a chicken coop). No matter what term you use, it means that you can keep the critical components of your business running through at least some kinds of serious interruptions. Here we’ll just call it all “DR” for “disaster recovery.”
No matter what your business is, you rely on your IT to keep it going. Some parts of your IT are critical to staying in business. This probably include orders & billing, process control, and connections to your key partners. Other parts are important but not as critical like payroll and HR activities. For each system, there is some amount of time you can keep the business operating if the system is unavailable. Beyond that, you are closed, perhaps temporarily, or worst case permanently. But a complete continuous operations system that guarantees that your critical systems are always available can be very expensive, usually prohibitively expensive except for governments.
What could go wrong? For starters, your facility can be impacted by things like
- Loss of power
- Local flood
- Building failure (e.g., collapse due to snow on the roof or the sudden insertion of an 18-wheeler or other large object)
- Access (e.g., a nearby chemical leak that evacuates the entire building)
- Communications connections (in one case, a company had two ISPs so that a failure at one would not knock them off the Internet, except one of their own maintenance men ran a backhoe through the conduit where both ISPs’ connections entered their building)
Plus there are larger geographic events, including
- Weather (Hurricane Katrina was 500 miles across. How far away is your DR site?)
- Regional power outage
- Regional flood
- Forest or prairie fire
There are two important objectives that need to be set, and then met, for each system.
Your recovery time objective (RTO) is the time period after a disaster at which business functions need to be restored. This is the length of time after a failure occurs that you need to be back in operation.
Your recovery point objective (RPO) is the age of files that must be recovered from a backup or other mechanism. The RPO is expressed backward in time (that is, into the past) from the instant at which the failure occurs. This is a measure of how old the data can be when you come back up.
The ideal situation is RTO and RPO both close to zero, but the cost can increase exponentially as these numbers get smaller. As a rule of thumb, there is a significant cost jump as you change the units of these objectives, from days to hours to minutes to seconds. Finding the right objectives is a balancing act between cost and benefit.
In general, the further away your DR site is from your main site the more it costs to achieve low RTO and RPO. The technology exists today to have a DR site 5,000 KM away (about 3,000 miles) with reasonable RTO and RPO values.
If you want a DR solution you need four things, in order of increasing difficulty:
- An infrastructure to operate on, including servers, storage and communications.
- Your own applications and operating environment, all at the appropriate release and patch level.
- Knowledgeable people to operate it.
- Your data.
Infrastructure is easy. You may even be able to share with another company by acting as each other’s DR facility.
Applications and operating environment is fairly easy, but you will need to check your software license agreements to make sure you are not violating your vendor agreements. There is no standard. Making sure the DR site always has the correct software version and patch levels isn’t trivial.
Many organizations plan to take some of their own operations people to the DR site when necessary. At one point in Southern California a few years ago, the brush fires literally surrounded a major city. No airplanes could fly, the trains were shut down, and the only road out required a 200 mile detour on traffic-choked roads. Depending on the nature of the disruption, your people may be distracted by more pressing personal issues. You should at least think about alternatives to your own personnel.
The real hard part is getting the data there. This requires careful analysis in order to meet your RPO. The issue here is more than just periodically backing up. We had a customer who called us one morning to tell us their office had a fire and was completely destroyed overnight. Fortunately, nobody was hurt, but everything was gone. We were able to quickly get replacement equipment, some rooms in a local hotel, and worked with the phone company to get their data communications shifted so that by 2PM that day we had a good enough working environment to at least keep them in business. Except for their data. Their only backup copy was in their office. The key is not only to have backups, but to have them somewhere safe and accessible, and that has to be counted when evaluating your RPO.
So how does the Cloud help?
Most Cloud Service Providers have solved many of these problems and provide some level of built-in DR for applications running in their own Cloud environment including geographically separated sites. Some specialize in DRaaS, Disaster Recovery as a Service. Almost by definition, all Cloud vendors can get your specific infrastructure up very quickly, sometimes in seconds. Most also know how to keep your software stack, your applications and operating environment, up to the appropriate levels. Through their automation and load balancing tools, they can usually operate your software environment efficiently without any input from you. By the very nature of the Cloud, they are designed to interface to clients easily. Usually, switching between those sites is a single configuration change, often done automatically by the managing software.
Since the Cloud Service Provider is serving multiple customers, they can provide a more robust DR environment and spread the costs among all of their subscribers, thus each subscriber has a reduced cost for a high level of DR capability beyond what the individual subscribers could afford to build themselves. If you are already using the Cloud as an overflow production facility when your existing IT infrastructure can’t support the workload, then you pretty much have a DR capability. You need to deal primarily with switching communications links, but you probably have 95% of what you need for a complete DR capability.
One interesting side effect is that once you use the Cloud for DR, you have solved most of the problems to use the Cloud for your production.
The main issues of using the Cloud as a DR solution are the same as for moving anything into the Cloud: security, performance, and availability. You can probably afford reduced performance and lower availability goals until you get your own facility operational again. Security remains the most difficult aspect, and the responsibility for any data loss is yours, not your Cloud vendor.
The last word:
Don’t forget to test, carefully. You must test it initially, and every time you make a significant change to your IT environment or processes, add a critical partner, and periodically just so you know it still works. Always test the worst case scenario – you have no access to your current facility nor any of your own personnel.
Of course, the best plans have opportunities. A company had a hardened data center facility with 10 minutes of battery power and its own diesel generator. Once a month, the ops manager would start the generator, run the whole center off the generator for 20 minutes, then switch back to utility power. One day they really lost power, the generator automatically started and everything worked. The batteries had picked up the slack until the generator ran – no interruption in power. For 20 minutes. Then the generator stopped. The new ops manager added “check diesel fuel level” to the monthly checklist.
Keep your sense of humor.