Auto-releasing chaos monkeys to harden your network (CT/IR)

In earlier posts, we’ve talked about using Netflix’s chaos monkey approach as a way of getting to Zero Touch Assurance (ZTA). The chaos monkeys intentionally trigger faults in the network as a means of ensuring resilience. Not just for known degradation / outage events, but to unknown events too.

I’d like to introduce the concept of CT/IR – Continual Test / Incremental Resilience. Analogous to CI/CD (Continuous Integration / Continuous Delivery) before it, CT/IR is a method to systematically and programmatically test the resilience of the network, then ensuring resilience is continually improving.

This is done by storing a knowledge base of failure cases, pre-emptively triggering them and then recording the results as seed data (for manual or AI / ML observations). Using traditional techniques, we look at event logs and try to reverse-engineer what the root-cause MIGHT be. In the case of CT/IR, the root-cause is certain. We KNOW the root-cause because we systematically and intentionally triggered it.

The continual, incremental improvement in resiliency potentially comes via multiple feedback loops:

Ideally, the existing resilience mechanisms work around or overcome any degradation or failure in the network
The continual triggering of faults into the network will provide additional seed data for AI/ML tools to learn from and improve upon, especially root-cause analysis
We can program the network to overcome the problem (eg turn up extra capacity, re-engineer traffic flows, change configurations, etc). Having the NaaS that we spoke about yesterday, provides greater programmability for the network by the way.
We can implement systematic programs / projects to fix endemic faults or weak spots in the network *
Perform regression tests to constantly stress-test the network as it evolves through network augmentation, new device types, etc

Now, you may argue that no carrier in their right mind will allow intentional faults to be triggered. So that’s where we unleash the chaos monkeys on our digital twin technology and/or PSUP (Production Support) environments at first. Then on our prod network if we develop enough trust in it.

I live in Australia, which suffers from severe bushfires every summer. Our fire-fighters spend a lot of time back-burning during the cooler months to reduce flammable material and therefore the severity of summer fires. Occasionally the back-burns get out of control, causing problems. But they’re still done for the greater good. The same principle could apply to unleashing chaos monkeys on a production network… once you’re confident in your ability to control the problems that might follow.

* When I say network, I’m also referring to the physical and logical network, but also support functions such as EMS (Element Management Systems), NCM (Network Configuration Management tools), backup/restore mechanisms, service order replay processes in the event of an outage, OSS/BSS, NaaS, etc.

May 23, 2019
Ryan

If you found this article useful or valuable, subscribe (in the top-right corner of this page) and share. Let's spread the word and inspire more people to become passionate about OSS. Ryan is Passionate About OSS and has dedicated the last two decades to sharing his passion for OSS with the world. He is a founder, author, blogger, Engineer, connector and inquisitive learner about OSS and managing networks. To find out a little about his back-story and why he's so Passionate About OSS, click on the About Page. To connect with Ryan and the PAOSS team, click on the Contact page.

All Posts