How would your OSS handle a troop of chaos monkeys?

A recent blog discussed the concept of running fire drills on your OSS to see how your organisation would handle outages.

In Monday’s blog, we discussed where OSS customers are going to be and what their requirements are going to look like. The “Telcos as OTT players” model will be hyperscaled and more “transient by nature – services, contracts, capacity, etc. will be spun up and torn down at will, requiring rapid flexibility to respond to market needs far more quickly than the current OSS.”

Netflix has a novel approach to combining these two concepts. They’ve developed open source software that they call Chaos Monkey, which “is a software tool that was developed by Netflix engineers to test the resiliency and recoverability of their Amazon Web Services (AWS). The software simulates failures of instances of services running within Auto Scaling Groups (ASG) by shutting down one or more of the virtual machines.(WhatIs.com)”

As described on the Netflix blog, Chaos Monkey is a highly configurable automation that resiliency-tests the applications, people and processes of their solution. In their words, “We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient…. Failures happen and they inevitably happen when least desired or expected. If your application can’t tolerate an instance failure would you rather find out by being paged at 3am or when you’re in the office and have had your morning coffee? Even if you are confident that your architecture can tolerate an instance failure, are you sure it will still be able to next week? How about next month? Software is complex and dynamic and that “simple fix” you put in place last week could have undesired consequences. Do your traffic load balancers correctly detect and route requests around instances that go offline? Can you reliably rebuild your instances? Perhaps an engineer “quick patched” an instance last week and forgot to commit the changes to your source repository?

How well would your OSS hold up against an attack by a Chaos Monkey?

Read the Passionate About OSS Blog for more or Subscribe to the Passionate About OSS Blog by Email

Leave a Reply

Your email address will not be published. Required fields are marked *