In earlier posts, we’ve talked about using Netflix’s chaos monkey approach as a way of getting to Zero Touch Assurance (ZTA). The chaos monkeys intentionally trigger faults in the network as a means of ensuring resilience. Not just for known degradation / outage events, but to unknown events too.
I’d like to introduce the concept of CT/IR – Continual Test / Incremental Resilience. Analogous to CI/CD (Continuous Integration / Continuous Delivery) before it, CT/IR is a method to systematically and programmatically test the resilience of the network, then ensuring resilience is continually improving.
This is done by storing a knowledge base of failure cases, pre-emptively triggering them and then recording the results as seed data (for manual or AI / ML observations). Using traditional techniques, we look at event logs and try to reverse-engineer what the root-cause MIGHT be. In the case of CT/IR, the root-cause is certain. We KNOW the root-cause because we systematically and intentionally triggered it.
The continual, incremental improvement in resiliency potentially comes via multiple feedback loops:
- Ideally, the existing resilience mechanisms work around or overcome any degradation or failure in the network
- The continual triggering of faults into the network will provide additional seed data for AI/ML tools to learn from and improve upon, especially root-cause analysis
- We can program the network to overcome the problem (eg turn up extra capacity, re-engineer traffic flows, change configurations, etc). Having the NaaS that we spoke about yesterday, provides greater programmability for the network by the way.
- We can implement systematic programs / projects to fix endemic faults or weak spots in the network *
- Perform regression tests to constantly stress-test the network as it evolves through network augmentation, new device types, etc
Now, you may argue that no carrier in their right mind will allow intentional faults to be triggered. So that’s where we unleash the chaos monkeys on our digital twin technology and/or PSUP (Production Support) environments at first. Then on our prod network if we develop enough trust in it.
I live in Australia, which suffers from severe bushfires every summer. Our fire-fighters spend a lot of time back-burning during the cooler months to reduce flammable material and therefore the severity of summer fires. Occasionally the back-burns get out of control, causing problems. But they’re still done for the greater good. The same principle could apply to unleashing chaos monkeys on a production network… once you’re confident in your ability to control the problems that might follow.
* When I say network, I’m also referring to the physical and logical network, but also support functions such as EMS (Element Management Systems), NCM (Network Configuration Management tools), backup/restore mechanisms, service order replay processes in the event of an outage, OSS/BSS, NaaS, etc.