There’s a common misnomer that AIOps and Network Automation solutions are just stood up and a bunch of people are stood down. That is, as soon as the tools become operational, the network operator can start reducing head-count. Project sponsors pray for it (to justify their business cases), network operations staff fear it (for the impact it could have on their careers).
However, the reality is that the tools of today (and foreseeable tomorrows) is more of a progressive benefit realisation, where it takes time for the machines to handle more and more event situations by themselves. We like to present this in the following (stylistic) asymptote diagram:
At T0, all events are handled manually (this assumption usually isn’t quite true because AIOps / automation tools are generally replacing the rules-based automations of today, but bear with us). Then as more and more patterns are identified, codified and automated each month (T1, T2, onwards), the machines handle more of the load.
But as an asymptote implies, we probably never actually reach a state where machines are handling 100% of incidents.
As we discussed previously, this leaves us with a few dilemmas to wrestle with, including a potential for a skills-gap to widen between network masters and apprentices.
In advanced network operation organisations today, resiliency models are strong and most Sev1 outages are akin to Black Swan events – situations or combinations of events that haven’t been seen before.
The network operators rely on network masters, who we refer to as highly trained ninjas. These people generally have decades of experience, having done their “apprenticeships” back in the days before automations existed. This meant they have seen it all before. Even if they haven’t seen a specific black swan pattern, they generally do see similarities or symptoms that point them to the root cause of the problem.
All of this leaves us with the following 12 significant dilemmas that we will need to overcome to reach our objectives of zero-touch operations of complex networks.
Dilemma #1
As mentioned, many current ninjas have been around for decades, allowing them to know how complex T1 networks hang together and spot the tell-tale signs of root-causes. But having been around for decades mean that many of today’s ninjas are moving ever-closer to retirement.
Dilemma #2
In recent years, a lot of network apprentice roles have been outsourced. This means there are fewer apprentices on the path to ninja status.
Dilemma #3
Not only have apprentice roles been outsourced, but as we increasingly hand responsibilities for fault-finding / fixing to machines, we potentially have no more apprentices on the path to ninja status at all. We’re leaving all responsibility in the hands of the machines.
Dilemma #4
The automation curve never reaches the asymptote in the foreseeable future. That means there will always be some patterns that the machines can’t handle. Do you think these will be the simple, common patterns? Nope! They’ll be the most obscure, multi-domain, highly complex patterns.
Dilemma #5
The assurance tools we’ve built for the last 2-3 decades have all been designed to handle event volumes at scale. As we approach the asymptote, volume is no longer the problem our assurance tools need to help solve. The next generation of assurance tools must help to solve only the most obscure, multi-domain, highly complex event sequences. That’s a totally different UI / UX!
Dilemma #6
As networks get more virtualised and more complex (under the covers at least, but more abstracted and simple for human interactions), network operators are likely to have a greater dependence on their vendors / suppliers for challenging event sequences. Dilemma #2 exacerbates this.
Dilemma #7
The assurance tools we’ve built for the last 2-3 decades have all been designed to handle routine person-to-person escalation paths – a mentality of triage, tick and flick. In scenarios where most Sev1s are multi-domain and/or previously unseen combinations of events, then we increase the likelihood that close collaboration between domain ninjas (internal resources) and vendor / supplier experts (external resources). Again, that’s a totally different UI / UX required of our assurance tools!
Dilemma #8
If the ninjas of the future no longer have a comprehensive apprenticeship (dilemma #2) and are dealing only with outages that have never been seen before (dilemma #4), then we arguably no longer have a ninja team that can diagnose root-cause from memories of past patterns. This means we’re probably now looking for ninjas with an entirely different set of skills. Network domain knowledge is still an important attribute, but arguably it becomes more about a unique psychological make-up. That is a rare set of people who are able to operate in the unknown, to keep their wits in chaotic, high-pressure situations, to provide leadership and precise decision-making across teams of many skilled collaborators.
Dilemma #9
In these high-stress, multi-domain, complex situations (dilemma #8), it’s quite possible that automations and resiliency mechanisms are even adding to the challenges of re-instating an impacted network. Few of these mechanisms have “kill-switches.” Even if they did, which kill-switches would you pull and which would you leave running (in major outages where the blast-zones are large, we often rely on automations and/or scripts and/or roll-backs to a previous state to propagate high-volume changes that we hope will re-start the network)
Dilemma #10
There are more and more automations appearing in different parts of our network and systems stacks. We have closed-loop systems within domains (eg embedded within EMS / NMS). We have closed-loop systems across domains (eg SON). We have custom self-healing / self-provisioning / self-optimising / self-protecting mechanisms. Then we have over-arching systems like AIOps and Autonomous Networking and RPA tools that are also trying to adapt to changing situations. To use an analogy, think of this like your house having the air-conditioner’s thermostat set to 23oC and the central heating’s thermostat set to 25oC. They’re constantly at war trying to find their ideal state. But due to the number of well-intended closed-loop systems, we’re not just analogising to temperature here, but humidity, barometric pressure, etc, etc. And due to abstractions (as mentioned in dilemma #6), we might not even be aware that some systems even have thermostats (closed-loop mechanisms). To extend on dilemma #9, how many of these have kill switches?
Dilemma #11
In my experience with carriers at least, they tend to prioritise ITIL processes Incident Management, Problem Management and Event Management (reactive situations). Less effort and resources tend to go into Service Validation & Testing, especially of the combinational scenarios and change events that Sev1s mostly tend to comprise of (even though this is the more proactive / preventative approach).
Dilemma #12
The people responsible for designing and building the automations are often the same people whose roles will be cut once automations are stood up. [Note that this is also true in situations where SIs have won NOC outsourcing contracts as well as AIOps / automation systems contracts]
What do we do next?
These dilemmas mean we have to totally re-think the way we handle network assurance:
- Tools – designing UI / UX for complexity and collaboration rather than volume and escalation. We also need to ensure that automated assurance tools are watched to see that they’re righting the ship, not steering it towards the syrens.
- Training – planning ninja pathways and test / train for performance in chaotic / stressful situations
- Processes and Ops Model – we need to carefully plan a progressive change in processes and Ops Models as we incrementally move from a near 100% manual to a near 100% automated operational environment. This happens over a period of months and years.
- Culture – there are many human factors involved in this transition to near-total automation
- Testing – in relation to dilemma #11, we have the means to shift-left in the ITIL flow (ie re-allocate resources to be proactive rather than reactive). With automated testing, logging and load-balancing approaches, we now have the ability to expend much effort on testing before / during / after a change to avoid outages or roll-forward during changes. We have the ability to do more combinational / cross-domain testing to find black-swans before they happen. [In fairness, they also tend to do quite a good job of Availability / Capacity / Continuity Management (design / architecture resilience mechanisms)].
As always, it’s important that I ask what have I left out? I’m sure there are many other dilemmas that I’ve overlooked. I’m sure there are even more next steps that need to be considered. I’d love it if you were to leave your thoughts in the comments section below to expand on what’s written above!
2 Responses
Nice article. Insightful as always. The level of Tribal Knowledge that gets enmeshed with the ways of working for some Network Operators is astonishing. AIOPs help us zero into those, unpack them layer by layer, and I guess eventually head towards that point where we can plan zero touch operations. What I have generally observed is that zero touch is easier in Greenfield sort of Operations where the vendor has good documentation around methods of procedures, and it takes a while to crack it in legacy networks because of the lack of proper documented procedures and the lack of standards.
There should be an AIOPs standards in the future, where a vendor needs to be certified for AIOPsiness in the future, ie. standardized APIs, procedures etc all aligned in some TMF Format. Or Perhaps I am dreaming too much on a Friday morning!!
Thanks Hussain!
I feel like it’s the combinations of events or cascading events that are likely to challenge us.
As you suggest, anything we can do to standardise and/or reduce complexity will help not just for AIOps / automation but for many other aspects of telco operations… as described in this old article about the Pyramid of Pain! https://passionateaboutoss.com/the-pyramid-of-oss-pain/
How do we make the network simpler? How do we make the product offerings (and product variants) simpler? That’s a starting point for reducing the number of variants, which means the less combinations of possible things that can potentially go wrong…. thus making zero-touch more realisable.
Enjoy your Friday and weekend Hussain!