I still get surprised by the large gap between reality and perception of many companies’ executives when it comes to the availability of their IT infrastructure, especially as it related to how long they will be down when something goes wrong. I wrote about this almost four years ago, and unfortunately not much has changed.
No matter what your business is, you rely on your IT to keep it going. Some parts of your IT are critical to staying in business. This probably includes orders & billing, process control, and connections to your key partners. Other parts are important but not as critical like payroll and HR activities. For each application, there is some amount of time you can keep the business operating if the system is unavailable. Beyond that, you are closed, perhaps temporarily, or worst case permanently. But a complete continuous operations environment that guarantees that your critical applications are always available can be very expensive, usually prohibitively expensive except for governments and very large financial companies.
What can go wrong? Failures generally fall into three categories, which I’ll call annoying, serious, and catastrophic.
Annoying failures are things like:
- Updating the OS or other software on servers or workstations
- Moving virtual instances among your physical servers
- An OS or hardware failure in a server
- The failure of a storage unit that is protected by some level of RAID so no actual data was lost.
In some cases these events are predictable, but in general if you are following reasonable best practices the interruptions are local to one application and recovery is usually automatic. In some cases, there is no impact to customers and employees; in most cases the interruption is measured in single-digit minutes. Most businesses are not impacted by these failures.
Serious failures generally cause failure of multiple applications that will require some level of manual effort to remediate. These failures include:
- Loss of power, including even a short power loss while your emergency backup power system starts up.
- Loss of database integrity due to a hardware or software failure. This type of failure requires that the database be recovered, possibly from archive and audit files.
Short-term loss of access to your data center may also lead to a serious failure if your normal datacenter operations require someone to periodically do something in the building, as when there are events such as:
- Minor fire
- Local flood or other external event that prevents access but does not impact your building
- Police activity
- Serious failure of your communications connections
If you share your building with other tenants, then serious failures in their area may also become your serious failure.
Catastrophic events mean that you have lost access to your facility for an extended period, measured in days or weeks:
- Major fire, flood that inundates your building, severe weather, or other regional natural event
- Building failure (e.g., collapse due to snow on the roof or the sudden insertion of an 18-wheeler or other large object)
There are two important objectives that need to be set by management, and then met to some degree by IT, for each critical application.
Your recovery time objective (RTO) is the time period after a disaster in which business functions need to be restored. This is the length of time after a failure occurs that you need to be back in operation.
Your recovery point objective (RPO) is the age of files that must be recovered from a backup or other mechanism. The RPO is expressed backward in time (that is, into the past) from the instant at which the failure occurs. This is a measure of how old the data can be when you come back up.
RPO is the harder one to understand, but let’s take a simple example. If your RPO is one day, then a daily backup stored off-site is probably sufficient. If a serious failure occurs, you use the previous day’s backup to restore the databases and other files, then simply reapply all of the transactions that occurred since that backup was taken. If your RPO is less than a day, say four hours, then a daily backup doesn’t help at all. You need to continuously keep an updated copy of your databases and other critical files so that the copy is never more than four hours behind real time.
The ideal situation is RTO and RPO both close to zero, but the cost can increase exponentially as these numbers get smaller. As a rule of thumb, there is a significant cost jump as you change the units of these objectives, from days to hours to minutes to seconds. Finding the right objectives is a balancing act between cost and benefit.
In general, the further away your disaster recovery (DR) site is from your main site the more it costs to achieve low RTO and RPO. The technology exists today to have a DR site 5,000 KM away (about 3,000 miles) with reasonable RTO and RPO values, but probably unreasonable cost. Hurricane Katrina was 500 miles cross, and Sandy was 1,000 miles across. How far away is your DR site? If you don’t have a DR site, then you can’t deal with a catastrophic failure.
The first step is for the business to determine the appropriate values for RTO and RPO for each application. This must be done realistically: simply saying “zero” for everything does not help. I suggest starting with RTO. The business must determine what the cost to the business is for an application to be down for a minute, an hour, a day, then narrow down to a realistic value. For most business, this is somewhere between two hours and one day for those few critical applications. RPO is usally based on the number of transactions that come in a minute or an hour, and making a business decision on how many transactions can afford to be delayed or require manual effort to recover. That recovery process normally starts after the system is back up and thus is additive to RTO. RPO often is influenced by how hard it is to recover those lost transactions. If a partner or some off site device maintains an electronic record, then the recovery may be automatic and fast. If it requires manual effort to recover each transacion, then RPO may need to be fairly small.
RTO and RPO are, by definition, objectives; they are not mandates. That is where the problem comes in. Management will say that, for example, RTO is two hours and RPO is one hour, applaud themselves and forget about it. IT, constrained by budgets and personell, are likely to not be able to implement systems that meet this objectives. Usually, these objectives can be met fairly easily for annoying failures, and with suitable automation for many serious failures. But unless they have the budget to support a functioning DR site, your IT staff is not meeting your objectives for a catastrophic event.
Talk to your IT staff, not just the CIO, and find out what the reality is. Note there will be a different rality for annoying, serious, and catastrophic failures. Don’t change the objectives: those are are driven by business needs. Find out what it will cost to move to the next level on meeting your objectives and make a business decision. Review your objectives and reality once a year – changes in your business may require changing your objectives, and changes in technology may provide opportunities to make reality closer to those objectives.
Don’t forget to test, carefully. You must test it initially, and every time you make a significant change to your IT environment or processes, add a critical partner, and periodically just so you know it still works. Always test what happens with each type of failure: annoying, serious, and catastrophic.
The last word:
The Cloud can help!
Most Cloud Service Providers have solved many of these problems and provide some level of built-in DR for applications running in their own Cloud environment including geographically separated sites. Some specialize in DRaaS, Disaster Recovery as a Service. Almost by definition, all Cloud vendors can get your specific infrastructure up very quickly, sometimes in seconds. Most also know how to keep your software stack, your applications and operating environment, up to the appropriate levels. Through their automation and load balancing tools, they can usually operate your software environment efficiently without any input from you. By the very nature of the Cloud, they are designed to interface to clients easily. Usually, switching between those sites is a single configuration change, often done automatically by the managing software.
Since the Cloud Service Provider is serving multiple customers, they can provide a more robust DR environment and spread the costs among all of their subscribers, thus each subscriber has a reduced cost for a high level of DR capability beyond what the individual subscribers could afford to build themselves. If you are already using the Cloud as an overflow production facility when your existing IT infrastructure can’t support the workload, then you pretty much have a DR capability. You need to deal primarily with switching communications links, but you probably have 95% of what you need for a complete DR capability.
One interesting side effect is that once you use the Cloud for DR, you have solved most of the problems to use the Cloud for your production.
The main issues of using the Cloud as a DR solution are the same as for moving anything into the Cloud: security, performance, and availability. You can probably afford reduced performance and lower availability goals until you get your own facility operational again. Security remains the most difficult aspect, and the responsibility for any data loss is yours, not your Cloud vendor.
Keep your sense of humor.