The Cloud is not a product; it is a concept. Unlike say a laptop, there are vast differences in the various Cloud offerings. It is critical to determine exactly what kind of Cloud implementation you need, and that will likely vary from one application to another. But even knowing that you need, as an example, a Public Infrastructure as a Service Cloud is not sufficient. Different Cloud Service Providers (CSPs) offer different flavors of the same type of Cloud. Sometimes this is deliberate as they try to satisfy a specific market; sometimes it is simply that is the way they do it. These differences can be significant. In a survey by Symantec and the Ponemon Institute, 75% of respondents said that the migration to Cloud Computing in their organization was occurring in a less-than ideal manner.
Why? I suspect because companies did not pay enough attention to the details. Moving to the Cloud is an important business imperative – companies that do not take advantage of the financial and agility benefits of the Cloud will be left behind. But the Cloud is a new paradigm for IT. Like all of the other disruptions in the data processing world over the past 50 years, it requires that IT management think differently. It also requires that IT really understand the requirements. In many of the organizations I have worked with, these requirements are not really understood and certainly not documented. Therefore, often the hardest part of moving to the Cloud is determining what your requirements really are. Different workloads will have different requirements.
A while ago I posted a blog about Cloud Requirements that discussed the key requirement areas that influence Cloud implementation and CSP decisions:
Almost exactly one year ago I posted Availability in the Cloud about some of the strange things about availability Service Level Agreements (SLAs) that are in CSP contracts. Not much has changed. Like in a lot of things, you get what you contract, and pay, for.
This time I will concentrate on how to determine your availability requirements. In some cases, this may be easy. You may have included availability SLAs in contracts with your customers or partners. If so, those SLAs will provide a good starting place. One caution: check with your contracts people to make sure you don’t have different SLAs for different customers.
Even if you have documented availability SLAs, I suggest you also talk with your heads of marketing and sales. They may have a different view, or no view at all. It may be something they have not even thought about.
If you have a Customer Advisory Board, poll a few members to see what their view is. Are they happy with the published SLAs? Are they happy with your availability history? Would they be willing to pay for a higher level of availability?
Then, just for grins, ask your CEO, CFO, CTO and CIO what they believe your availability requirements are. If you get the same answer from all four (other than “I don’t know”) you are in an amazing position: you work for an organization that has a consistent senior management view of a critical attribute of your business that is usually totally overlooked by many of these executives.
At this point you should be able to create a set of availability requirements. You will probably notice that different applications have vastly different availability requirements. Use this as one of the means of dividing your total IT environment into separate groups, each group having similar availability requirements. Consider each group separately as you move to the Cloud.
There are three levels of events that impact availability.
- Local single failures.
These events include the failure of a server, a disk drive, a network component. These are usually quickly recovered. In most cases, the recovery time for this kind of failure is measured in minutes. Depending on exactly what fails and what it was working on at the time of the failure, the event could force a database recovery which can lengthen the recovery time.
- Complete database failure.
These are events that force an entire rebuild and recovery of one or more databases. These are often caused more by operational or software failures than a “simple” hardware failure. Recovery may recover the complete reloading of a database from the most recent backup, and re-applying all of the transactions that were processed since that backup. In most cases, the recovery time for this kind of failure is measured in hours.
- 3. Building failure.
These are events that make it impossible to enter your building for a period of time measured at least in days. This could be fire, weather, earthquake, government action due to civil unrest or terrorism, or a seemingly unrelated event that you are too close to (e.g., Fukushima). Without any significant preplanning and preparation, the recovery time for this kind of failure is measured in days or weeks.
The third category is normally referred to as Disaster Recovery. It is often considered separately because the cost of achieving a recovery time measured in hours is at least an order of magnitude higher than achieving a similar recovery time for the first two cases. Unless your organization has an implemented disaster recovery plan, probably everything you learned so far was only to cover the first two cases.
The next step is to determine what you can actually achieve today. Go to your IT operations leaders and ask them how long it would take to recover from each of the three cases. Show them what you have learned about management’s expectations. Be prepared to be astonished by reality. On more than one occasion management’s expectation for the first two cases was two hours, but the IT operational reality for the second case was more like two days. Often there is nobody on the operational staff that was there the last time they had a complete database failure, or the company is so new one has not yet happened, or at least one has not happened since the company became absolutely dependent on customer and partner communication over the Internet. They have not been burned, so they don’t think about. But, like a cyber attack, the question is not “if” one will occur, but “when.”
At this point you have a document that, probably for the first time, explains your company’s availability position. For each group of applications with similar availability requirements it details those requirements as specified by management or business needs, along with what is actually achievable today within each failure case.
Now it is time to get all the stakeholders in a virtual room and go over the current situation: what they expect, and what is reality. Determine what the real availability requirements should be. Determine if you really need a disaster recovery plan, and for what specific applications. Get a project plan from IT to make any application, database, or operational changes necessary to get reality in line with these expectations.
Now you are ready to talk to the CSPs. You have the requirements. Get what you need for each group of applications, but no more.
The last word:
Currently there is little uniformity in the contracts or even the terminology among CSPs. It can be difficult to compare availability SLAs across CSPs. The best way is to sit down and figure out exactly what it means to be “down.” Determine what is excluded from counting as down time. For most CSPs it is anything that is “planned.” Many will make a “best effort” to notify you of pending planned outages. Does that make it any better?
Most importantly, make sure that anything that was committed to you in verbal or written conversations during the negotiations is part of the contract. If you provided a RFP (Request for Proposal) or any documented set of requirements that the CSP responded to, make sure your requirements and the CSP’s response are included in the contract.
Keep your sense of humor.