Follow Us

Views from the Lab

Testing for cloud - Unleashing a barrel of monkeys



Indeed cloud providers strive to offer high reliability for their services, but part of this promise supposes that tenants use the services correctly. Large scale cloud failures do happen and services can be interrupted.  Tenants that do not properly architect for these cases can be greatly impacted during these outages.

Cloud tenants cannot rely on traditional data centre-based availability solutions: typically tenants have little to no direct visibility or control over the underlying infrastructure resources. So resolution of failures in the underlying compute environment is left to the cloud provider. Therefore, cloud-based applications must leverage other means. 

Tenants augment reliability with high availability (HA) mechanisms that provision more and redirect to new resources when existing resources in the compute environment are compromised. The HA mechanisms often rely on tenant’s correct implementation of mechanisms offered by the cloud provider (e.g., auto-scaling).

Consider the case study published by Netflix. As a cloud tenant, Netflix has built and deployed its core business services (e.g. streaming movies, recommendation) on Amazon Web Services (AWS) since 2010. Bearing in mind that it’s ultimately the cloud tenants’ responsibility to architect resiliency into their applications to operate through and recover from failures, Netflix developed Chaos Monkey, a service that randomly turns off virtual machines (VMs) to proactively mimic environmental failures in order to test Netflix’s recovery mechanisms.  

As such, Netflix can detect and resolve the implementation of mechanisms for as many failure scenarios as possible. The goal is to learn from the failures so that Netflix won’t fail the same way twice.

Chaos Monkey targets the scenario of VMs running using AWS’s auto scaling groups. By default, the auto scaling groups should automatically detect the termination of an instance, and replace it with a new identically configured instance. 

In addition to Chaos Monkey, Netflix created a Simian Army to verify HA mechanisms in response to other types of environment-based failures like an outage of an entire availability zone or the impact of introducing artificial delays REST services to simulate service degradation. The point is to proactively simulate disruption to test the implementation of recovery mechanisms.

Chaos Monkey has helped Netflix improve its resiliency against cloud outages. Netflix services ran without interruption and intervention (albeit with higher latency and higher than usual error rate) through an outage on 21 April. 

However, an approach like Netflix’s is just the first step, and does not cover many scenarios.  Environment-based failures go beyond those based on turning off VMs or adding delays.  Other failures to consider include the following: 

  1. Network failures may cause a set of VMs to be unreachable.
     
  2. Problems can be caused by overloaded VMs rather than completely dead VMs. 

  3. The purely random approach of turning off VMs does not guarantee that all VMs, or even the critical VMs, are tested. 

  4. Outages can still be caused by security vulnerabilities of the systems. 

  5. VM granularity is not enough:  not big enough to verify data centre or regional failures, nor small enough to verify degradation or failures of the services running on the VM.

Testing against a library of “what if” failure scenarios becomes an essential step in application deployments to cloud.  

To ensure reliability, applications on cloud require a new way to test that perturbs the underlying environment.  The role of the monkey is to cause problems and to create chaos.

Only then can you verify the effectiveness of your automated recovery mechanisms. When testing for cloud, introducing failures becomes an essential step.  You need to fail often so you don’t fail when it counts.  

Posted by Teresa Tung, Manager, Accenture Technology Labs and Qing Xie, Researcher, Accenture Technology Labs

Enhanced by Zemanta

Tags: accenture, amazon web services, cloud computing, data centre, netflix, virtual machine, virtualisation

RSSSubscribe to this blog

Contact Us

For editorial queries:
Mike Simons Mike_Simons@idg.co.uk

For website issues:
Email webmaster@techworld.com

For commercial queries
Russell Kearney russell_kearney@idg.co.uk


For more contact details click here.


Email this to a friend

* indicates mandatory field





Techworld White Papers

Optimising data protection for virtual environments

VM environments require the same level of data protection as does the physical server environment. Companies may use data protection tools built for the physical environment in the virtual world, but this has serious disadvantages.

Download Whitepaper

PCI Compliance: Are UK businesses ready?

Exploring the results of a recent survey, including: ? Levels of understanding of the standard ? Current perceptions of actual compliance status ? Attitudes toward addressing compliance

Download Whitepaper

Mobility Management for Dummies

Your complete guide to managing and securing mobile devices such as laptops and smartphones.

Download Whitepaper

Magic Quadrant for midrange and high-end NAS solutions

It is difficult to find one midrange or high-end NAS product that can cater to all needs. File systems embedded in NAS are often designed to solve one major pain point, with additional features being added later to broaden use cases and benefits.

Download Whitepaper

Techworld UK - Technology - Business

Oracle Video

Enabling agile and intelligent businesses

 Changing markets, competitive pressures and evolving customer needs are placing increasing pressure on IT to deliver greater flexibility and speed. Explore truly flexible SOA foundations with this Oracle video.

Watch
COLT White Paper

IT Misuse Survey

Complete this survey and you could win a Nexus One

Techworld are running a short survey to discover how UK businesses are managing Internet and email misuse in the Enterprise.

Complete Survey

Complete our survey and you could win a Sony E-book Reader.
Techworld have teamed up with HP to compile a survey relating to server virtualisation. Complete the short survey and you could be the lucky winner of a Sony E-book reader.

Complete the survey here

Site Map

Test