In the old-school world of network assurance, we just polled our network devices and aggregated all the events into an event list. But then our networks got bigger and too many events were landing in the list for our assurance teams to process.
The next fix was to apply filters. For example, that meant dropping the Info and Warning messages because they weren’t all that important anyway…. were they?
But still, the event list just kept scrolling off the bottom of the page. Ouch. So then we looked to apply correlation and suppression rules. That is, to apply a set of correlations so that some of the alarms could be bundled together into a single event, allowing the “child” events to be suppressed.
Then we can get a bit more advanced with our rules and perform root-cause analysis (RCA). Now, we’re moving to identify patterns using learning algorithms… to reduce the volume of the event list. But with virtualised networks, higher-speed telemetry and increased network complexity, the list keeps growing and the rules have to get more “dynamic.”
Each of these approaches is designing a more highly filtered lens through which a human operator can view the health of the network. The filters and rules are effectively dumbing down the information that’s landing with the operator to solve. The objective appears to be to develop a suitably dumbed-down solution that allows us to throw lots of minimally-trained (and cheaper) human operators at the (still) high transaction count problem. That means the GUI is design to filter out and dumb down too.
But here’s the thing. The alarm list harks back decades to when telcos were happy having a team of Engineers running the NOC, resolving lots of transactions. Fast forward to today and the telcos aspire to zero-touch assurance. That implies a solution that’s designed with machines in mind rather than humans. What we really want is for the network to self-heal based on all the telemetry it’s seeing.
Unfortunately, rare events can still happen. We still need human operators in the captain’s seat ready to respond when self-healing mechanisms are no longer able to self-heal.
So instead of dumbing-down and filtering out for a large number of dumbed-down and filtered out operators, perhaps we could consider doing the opposite.
Let’s continue to build networks and automations that take responsibility for the details of every transaction (even warning / info events). But let’s instead design a GUI that is used by a small number of highly trained operators, allowing them to see the overall network health posture and respond with dynamic tools and interactions. Preferably before the event using predictive techniques (that might just learn from all those warning / info events that we would’ve otherwise discarded).
Hat tip to Jay for some of the contrarian thoughts that seeded this post.