Follow Us

Views from the Lab

Thriving in the predictably unpredictable



Cloud’s on-demand value proposition is attractive, but there remain concerns around its reliability. Indeed service levels offered by most providers are basic—prominent providers promise 99.95% availability. Also service response varies.

Our recent measurements of a widely known Infrastructure-as-a-Service (IaaS) provider find that the time-to-start an instance, that is, the time from requesting a new virtual machine instance until it is ready to use, varies from a couple of minutes for a Linux machine, to between 10-20 minutes for a Windows machine. The I/O performance varies by a factor of 6, and the network by a factor of 10. Cloud can be predictably unpredictable.

But this unpredictability may be unacceptable, especially for those applications with stringent service level agreements (SLAs). What is a cloud consumer to do? We can push IaaS providers to guarantee the needed SLA outright, which may come at a steep price if it is even possible. Or, we can architect our application to achieve the needed SLA at the application level while using the commodity cloud components at the infrastructure level.

Let’s borrow a page from playbook of content delivery networks (CDNs), such as Akamai or Limelight Networks. These CDNs monitor network conditions of various service providers. The CDNs then meet custom transport guarantees by properly determining what Internet links to use and how much to use them. The usage assignment is not static, but adapts to the real-time measurements of the underlying network links. To meet custom SLAs via commodity cloud, we can extend the same concept of the overlay to enterprise applications.

First, we need to know what the cloud conditions are and how they vary over time. To find the state-of-the-cloud, simply use the cloud: start a few virtual machine instances, ping the network, run some workloads. This approach is based on hosting the same webpage workload across many cloud providers and then monitoring the availability and response across geographies (important since sometimes the wide-area network is going to be the bottleneck). This information determines the best provider to use which will change based on where the users are and when we want to use the service.

We can then adjust enterprise applications according to our measurements. Operations like on-demand provisioning are essential design elements that can be used to combat unreliability and instability of the lower level cloud. For example, we can add scaling rules like "if system utilisation hits 80%, then start one more new instance". Or a detect-restart rule that monitors running nodes, and automatically restarts nodes when a failure is detected.

Indeed by themselves, these rules do not relate to service level guarantees. But taken together with measurements of the underlying IaaS, the thresholds and settings of the rules relate to application-level SLAs.

For example, to meet demand with high probability, instead of always triggering scaling at 60%, we trigger at 50% when underlying IaaS reliability degrades. Or instead of starting one new instance, we start three when the time-to-start an instance gets longer. We need to adapt the trigger and the response to the actual measurements of the state-of-the-cloud. In this scaling case, we adapt based on the time-to-start an instance, the reliability of the capacity of each instance, and the demand.

Let’s take a look at a specific scenario. We need a service with 99.999% availability but the underlying infrastructure only provides 99.95% availability. What can we do? Here are our options:

  1. High Availability: Using two independent 99.95% commodity services will satisfy 99.999% (to be precise the availability is 1 - (1-0.9995)^2).
  2. Detect-Restart: Consider the detect-restart rule that restarts a machine when it fails. Taken in conjunction with estimates of the mean-time-before-failure (MTBF) of and the mean-time-to-recover (MTTR) an instance, we can map the availability (MTBF/(MTBF + MTTR)) to the SLA. Using the scenario above, this scheme achieves 99.95% for Windows and 99.99% for Linux.

In other words, we can meet 99.95%, 99.99% and 99.999% under various schemes based on different underlying provider conditions. Using the measured conditions, we compare the high availability option with one that results in slightly less availability but also reduces the provisioned resources by half—are the extra 9s worth it?

Cloud can be predictably unpredictable, but we can meet high availability goals if we architect applications in the right way. Armed with state-of-the-cloud information, we can map our operations rules to meet custom SLAs. We adapt the parameters of our rules as the state-of-the cloud varies to maintain fixed SLAs. In this case, the best defence is a good offence: If our virtual machine goes down, we turn it back on. If the capacity drops, we get another one. If our current provider doesn’t have one or my provider goes down, we use another provider. As the cloud consumer, we monitor conditions to determine what to do and when to do it.

Why is there this service variation? It has something to do with shared resources being consumed by disparate workloads. But really, who cares? The question of "why" is better left for the cloud provider who can actually do something about it. Instead for the cloud consumer the question should be "what are you going to do" to thrive in the predictably unpredictable.


Posted by Teresa Tung, Ph.D


Email this to a friend

* indicates mandatory field





Techworld White Papers

Optimising data protection for virtual environments

VM environments require the same level of data protection as does the physical server environment. Companies may use data protection tools built for the physical environment in the virtual world, but this has serious disadvantages.

Download Whitepaper

PCI Compliance: Are UK businesses ready?

Exploring the results of a recent survey, including: ? Levels of understanding of the standard ? Current perceptions of actual compliance status ? Attitudes toward addressing compliance

Download Whitepaper

Mobility Management for Dummies

Your complete guide to managing and securing mobile devices such as laptops and smartphones.

Download Whitepaper

Magic Quadrant for midrange and high-end NAS solutions

It is difficult to find one midrange or high-end NAS product that can cater to all needs. File systems embedded in NAS are often designed to solve one major pain point, with additional features being added later to broaden use cases and benefits.

Download Whitepaper

Techworld UK - Technology - Business

Oracle Video

Enabling agile and intelligent businesses

 Changing markets, competitive pressures and evolving customer needs are placing increasing pressure on IT to deliver greater flexibility and speed. Explore truly flexible SOA foundations with this Oracle video.

Watch
COLT White Paper

IT Misuse Survey

Complete this survey and you could win a Nexus One

Techworld are running a short survey to discover how UK businesses are managing Internet and email misuse in the Enterprise.

Complete Survey

Complete our survey and you could win a Sony E-book Reader.
Techworld have teamed up with HP to compile a survey relating to server virtualisation. Complete the short survey and you could be the lucky winner of a Sony E-book reader.

Complete the survey here

Site Map

Test