Artificial Intelligence is purported to be the panacea for so many high-volume factors within telco. For example, AI is being used for complex pattern recognition using diverse signal streams like churn reduction. AIOps solutions are making great strides to reducing noise within alarm / event lists, filtering and automating, to limit the number of decisions to be made by humans. We’re also seeing AIOps equivalent solutions starting to make progress for handling other high-volume streams like contact centre call handling and field work.
With this automated volume reduction occurring, we probably need to re-evaluate how our OSS user interfaces work. I see this as presenting across two dimensions:
- Pareto’s 20 – Whereas our UIs have historically been designed for handling transactions at volume, we can now start thinking instead about what’s slipping through the cracks. The ~80% is being automated, filtered, prioritised etc. The remaining ~20% (the 20% that’s really hard to characterise or triage or identify) is now where humans need to intervene to take it ever closer to 0% un-matched.
- Watching the watcher – We can’t just let the automations run without having some way of ensuring they’re doing the right things. We need a way of observing the automations. Even if they are designed to optimise for certain outcomes (eg truck-roll reduction), the environments that they’re working within are so complex that it’s quite feasible that secondary or tertiary effects (eg infrastructure costs) could become sub-optimal
For a UI to handle Pareto’s 20, it seems we need a workbench that accommodates three key factors:
- Blast Radii – The ability to observe the blast radius of any given event across all four proximities (time, topology, geography, object hierarchy), where the root-cause often triggers ripple-out alarms / events that can obfuscate the real root-cause
- Dependency Visualisation and Interactions – The ability to present data in ways that unravel the mesh of dependencies to unlock the real causes from the many possibilities. This could be like the approach used by detectives, to collect / connect the evidence and rule it in / out (as a slightly related aside, check out this interesting video about using graph theory to solve crimes, which might also apply to our root-cause identification concept, though I haven’t dived deep enough into this to consider whether it might be viable)
- Collaboration – The ability to collaborate with experts across any domains
Let’s look at blast radius first. Tools such as AIOps are great at identifying patterns amongst data that is just too voluminous for people to handle. Networks are able to produce tens of thousands of performance metrics at streaming speeds. Multiply that by the increasing number of network devices and combine that with multi-domain dependencies (eg customer services, network domain interconnection, etc) and we have far too much data for humans to process every second (or whatever cadence your metrics tick over at). The same is true if you’re trying to triage an event that hasn’t been picked up by the automations.
I believe our next-generation UIs can actually use AI to identify “probable / possible cause” clusters of metrics connections and present them in a workbench across space and time. Each stream (eg telemetry, alarms, flows, connections, etc) is presented as layers on a map view, and the operator is able to turn layers on/off to visualise possible linkages / impacts. More importantly, the user will have the ability to scrub backwards and forwards in time to see the sequence and proximities of events. (Note: Naturally, we’ll need the network to propagate these data streams rapidly because true sequencing can be lost if there are significant time delays).
We also need new ways to collaborate on complex triage – taking a more modern “chat” approach rather than ticket-and-flick-it as used historically. And when I say chat, this could mean chatting with other humans or could also pull in machine-input like line test results, social indicators, etc from something akin to a sophisticated chat-bot.
Now, coming back to watching the watcher, we always have to install guard-rails or threshold-boundaries on the automation streams to ensure we don’t get run-away feedback loops. The automations could have the ability to game the system, getting great results for primary objectives but be triggering unwanted side-effects. To avoid this, I believe we need more than just guard-rails, but highly visual presentations of data to see what’s happening inside the “black-box” of the automation. Similar to the blast-radius example above, I can envisage this as being presented in heat-maps across space and time… amongst other visualisation techniques.
Check out this video from Savvi, a tool that’s already available today, which provides a few hints about what we could make this look like:
Can you envisage what our next-gen OSS UIs will look like and why? Do you agree or disagree with any of the perspectives above? Do you have theories about how we can do this better? I’d love to hear your thoughts via the comment box below.
Oh and PS, I’d love to see our user-interfaces look far more advanced than they typically are today – to match the level of sophistication that’s going on behind the scenes in our amazing OSS solutions. Check out the user interfaces from Jorge Almeida in his stunning showreel below. Many are presented as Augmented Reality views, as our OSS will surely be within short periods of time as smart glass hardware catches up to software. As indicated recently, it’s not just the presentation of visual data that matters, but decision support to guide workers on what to do next!!