We’ve talked about the three main concerns of the Cloud: Security, Performance and Availability. I’ve spent a lot of time talking about the security aspect, but in reality these three issues are inter-related. A recent event in the Public Cloud and, more importantly, the reaction to it, has led me to talk about availability.
A lot of people believe the Cloud is very inexpensive and infinitely scalable. To a large extent, it is all of that. However, it is not that simple.
(OK, to get it out in the open: Amazon’s Public Cloud offering, Elastic Compute Cloud (EC2), had a serious service interruption lasting nearly two days. This article is not about that, and is not going to bash Amazon. If anything, it will bash the people who got what they paid for.)
I’ve been emphasizing that from a security standpoint, all of your data does not have the same requirements. Some data is critical to secure to avoid business or legal repercussions; some data does not require much care. The same is true of availability. Some of your applications are critical to your business: if they are not accessible, you are not in business. Many are not critical. If they are not accessible for a day or two it may be annoying, but their unavailability does not threaten the survival of your business.
Years ago I worked for a company that had a number of airlines as customers. Running an airline is complicated and requires a lot of computer systems to keep everything moving. One of those systems is the operations system. Among other things, this system generates the report that the US Federal Aviation Administration (FAA) requires before you can push a passenger airplane away from the gate. I was at the airport on my way to visit one of these customers when I got a call. “Please don’t come. We had a serious fire in our operations center overnight, and it will be probably two months before we can get back in the building.”
This airline had a backup plan, they tested it often, and it worked. Everything shifted over to their backup site. Not a single airplane departure was delayed due to this disaster. But they were not interested in anything new until they got the primary site back up.
At about the same time, another airline had a similar problem. In this case, one of their landscaping teams was using a backhoe and cut the fiber optics cables for both their primary and secondary communications suppliers, and for both their primary and backup facilities. Yes, they all ran through the same conduit. They had spent lots of money and effort to have a fail-safe system, but had somehow missed that one vulnerability. They had hundreds of cancelled and delayed flights.
Backup facilities are expensive, and may not be as reliable as you expect. Another customer many years ago had their data center near a major mid-western river. The river flooded one year and they were essentially down for a couple of weeks. Vowing to never let than happen again, they built a backup facility a couple of hundred miles away … up the same river. A couple of years later, the river flooded out both centers. Vowing to never let that happen again, they built a new backup facility away from any river. It was in tornado territory, but the building was hardened against pretty much any tornado, and since power was a risk, it had its own diesel generator. The plan was that when a tornado watch was issued for either their main or backup facility, they would crank up the diesel generator at the secure backup facility and ride it out. They tested this every month, bringing up the diesel generator and getting the backup facility fully operational. One day they got that tornado warning, and switched over to the diesel power. Sure enough, they lost building power. Twenty minutes later, they lost diesel power. One minor item missing from the monthly test: check the fuel gauge on the generator. It was two days before they could get more fuel delivered.
The important take-away from these stories, and the hundreds of others like them, is that it is very complicated, and therefore expensive, to build a secure backup facility that will always be there. Over and above the physical issues in these stories (fire, backhoe, flood, and fuel) are all of the operational issues: do you have the right data at the right place, are the system and application software components all at the correct patch level, have all of the configuration and operational changes been made at the backup site, …
Even if you have all of that handled, there is still the issue of people. Another customer had a backup facility a few hundred miles away with everything in place. The plan was that if there were a disaster, a team would drive to the backup site and get everything up and running. In their case, they could afford the six to eight hours of downtime that trip would cost. However, they were in Southern California and a couple of summers ago the wild fires made it so it was impossible to drive to the backup site, and the planes were not flying because of the smoke. They had a few tense hours as the fire was heading right for their primary data center. Fortunately, the fire fighters got ahead of the fire this time, but, if they had needed to, it would have been a couple of days before they could have physically gotten their team to the backup site.
Enter the Cloud. If you look at the major Cloud Service Providers (CSPs), included Amazon, they each have multiple data centers separated by hundreds and thousands of miles, including usually one or more centers on another continent. They have the network links between these centers to quickly move applications and data from one to another. More importantly, they have the processes, procedures and people necessary to keep those centers in sync in terms of a customer’s applications and data. They can make those transitions effectively, and often without any impact. However, if you want your applications to be able to run on more than one Amazon node, you have to pay for that capability. Those that did had minimal or no interruption. Those that had everything in Amazon’s Northern Virginia center, well, they noticed.
Like in a lot of things, you get what you contract, and pay, for.
Does this mean you should stop thinking about moving to the Cloud? No. But it means you should pay attention to your own availability requirements. If you absolutely positively need 99.99% up time all the time and your current environment provides that, stay out of the Cloud for now.
As always, the first step is to figure out what you really need in terms of availability. Which applications absolutely have to be up to stay in business. What is the impact if that application is down for a minute, an hour, a day, … If possible compute that impact in dollars.
When you talk to a CSP, have your availability requirements written down. Check the CSP’s availability SLAs (Service Level Agreements). Amazon’s EC2 SLA is that they will keep you up and running 99.95% of the time. That calculates out to 4.4 hours of downtime a year. Sounds good. There are, however, lots of exceptions and some strange ways of determining if your application is down. The penalties if Amazon doesn’t meet that goal: it will give you credit for up to 10% off your monthly bill for that month. I’m just using Amazon as an example – most CSPs have similar SLAs on their Public Cloud offerings. For example, Microsoft’s Azure doesn’t count down time unless you are completely down for 5 minutes: no access to any of your applications for the entire five minute interval. They also exclude any time spent applying software patches or updates. Again, these are typical components of a CSP’s Public Cloud SLA contract.
In my view, the SLAs are pretty useless other than as an indication of what the CSP believes it can almost always provide. If “almost always” is good enough for you, go for it. If not, look to do something better. Pay for the ability to have multiple nodes in Amazon EC2. Have more than one CSP running your applications so you stay up when one goes down. Have another CSP all ready to go: have all your software and data loaded in their environment. Then just turn it on when you need to. You will have to pay for the storage, but this is likely to be a few dollars a month depending on how much you have. Of course, there are some operational issues to be dealt with like keeping your data up-to-date on the backup CSP.
And don’t forget the security concerns. Can you legally have some of your data on another continent?
Some people are predicting a significant drop in Public Cloud usage because of the Amazon EC2 incident. The same thing can happen in a Private Cloud environment, or a Managed Services environment, or in your own shop. Personally, I don’t think this will have a significant impact on Cloud adoption – hopefully it will make people a little smarter.
The last word:
Was this the last significant Cloud failure? Nope. No more than we have seen the last train wreck, mad man with a bomb, stock market collapse, or force 8 earthquake. Which brings me to a t-shirt I recently saw at the gym: “Death is guaranteed. Everything else is earned.”
Keep your sense of humor.