How would your OSS handle a troop of chaos monkeys?

A recent blog discussed the concept of running fire drills on your OSS to see how your organisation would handle outages.

In Monday’s blog, we discussed where OSS customers are going to be and what their requirements are going to look like. The “Telcos as OTT players” model will be hyperscaled and more “transient by nature – services, contracts, capacity, etc. will be spun up and torn down at will, requiring rapid flexibility to respond to market needs far more quickly than the current OSS.”

Netflix has a novel approach to combining these two concepts. They’ve developed open source software that they call Chaos Monkey, which “is a software tool that was developed by Netflix engineers to test the resiliency and recoverability of their Amazon Web Services (AWS). The software simulates failures of instances of services running within Auto Scaling Groups (ASG) by shutting down one or more of the virtual machines.(WhatIs.com)”

As described on the Netflix blog, Chaos Monkey is a highly configurable automation that resiliency-tests the applications, people and processes of their solution. In their words, “We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient…. Failures happen and they inevitably happen when least desired or expected. If your application can’t tolerate an instance failure would you rather find out by being paged at 3am or when you’re in the office and have had your morning coffee? Even if you are confident that your architecture can tolerate an instance failure, are you sure it will still be able to next week? How about next month? Software is complex and dynamic and that “simple fix” you put in place last week could have undesired consequences. Do your traffic load balancers correctly detect and route requests around instances that go offline? Can you reliably rebuild your instances? Perhaps an engineer “quick patched” an instance last week and forgot to commit the changes to your source repository?”

How well would your OSS hold up against an attack by a Chaos Monkey?

October 5, 2016
Ryan

If you found this article useful or valuable, subscribe (in the top-right corner of this page) and share. Let's spread the word and inspire more people to become passionate about OSS. Ryan is Passionate About OSS and has dedicated the last two decades to sharing his passion for OSS with the world. He is a founder, author, blogger, Engineer, connector and inquisitive learner about OSS and managing networks. To find out a little about his back-story and why he's so Passionate About OSS, click on the About Page. To connect with Ryan and the PAOSS team, click on the Contact page.

All Posts