….I want to make my network so observable, reliable, predictable and repeatable that I don’t need anyone to operate it.
That’s clearly a highly ambitious goal. Probably even unachievable if we say it doesn’t need anyone to run it. But I wonder whether this has to be the starting point we take on behalf of our network operator customers?
If we look at most networks, OSS, BSS, NOC, SOC, etc (I’ll call this whole stack “the black box” in this article), they’ve been designed from the ground up to be human-driven. We’re now looking at ways to automate as many steps of operations as possible.
If we were to instead design the black-box to be machine-driven, how different would it look?
In fact, before we do that, perhaps we have to take two unique perspectives on this question:
- Retro-fitting existing black-boxes to increase their autonomy
- Designing brand new autonomous black-boxes
I suspect our approaches / architectures will be vastly different.
The first will require a incredibly complex measure, command and control engine to sit over top of the existing black box. It will probably also need to reach into many of the components that make up the black box and exert control over them. This approach has many similarities with what we already do in the OSS world. The only exception would be that we’d need to be a lot more “closed-loop” in our thinking. I should also re-iterate that this is incredibly complex because it inherits an existing “decision tree” of enormous complexity and adds further convolution.
The second approach holds a great deal more promise. However, it will require a vastly different approach on many levels:
- We have to take a chainsaw to the decision tree inside the black box. For example:
- We start by removing as much variability from the network as possible. Think of this like other utilities such as water or power. Our electricity service only has one feed-type for almost all residential and business customers. Yet it still allows us great flexibility in what we plug into it. What if a network operator were to simply offer a “broadband dial-tone” service and end users decide what they overlay on that bit-stream
- This reduces the “protocol stack” in the network (think of this in terms of the long list of features / tick-boxes on any router’s brochure)
- As well as reducing network complexity, it drastically reduces the variables an end-user needs to decide from. The operator no longer needs 50 grandfathered, legacy products
- This also reduces the decision tree in BSS-related functionality like billing, rating, charging, clearing-house
- We achieve a (globally?) standardised network services catalog that’s completely independent of vendor offerings
- We achieve a more standardised set of telemetry data coming from the network
- In turn, this drives a more standardised and minimal set of service-impact and root-cause analyses
- We design data input/output methods and interfaces (to the black box and to any of its constituent components) to have closed-loop immediacy in mind. At the moment we tend to have interfaces that allow us to interrogate the network and push changes into the network separately rather than tasking the network to keep itself within expected operational thresholds
- We allow networks to self-regulate and self-heal, not just within a node, but between neighbours without necessarily having to revert to centralised control mechanisms like OSS
- All components within the black-box, down to device level, are programmable. [As an aside, we need to consider how to make the physical network more programmable or reconcilable, considering that cables, (most) patch panels, joints, etc don’t have APIs. That’s why the physical network tends to give us the biggest data quality challenges, which ripples out into our ability to automate networks]
- End-to-end data flows (ie controls) are to be near-real-time, not constrained by processing lags (eg 15 minute poll cycles, hourly log processing cycles, etc)
- Data minimalism engineering. It’s currently not uncommon for network devices to produce dozens, if not hundreds, of different metrics. Most are never used by operators manually, nor are likely to be used by learning machines. This increases data processing, distribution and storage overheads. If we only produce what is useful, then it should improve data flow times (point 5 above). Therefore learning machines should be able to control which data sets they need from network devices and at what cadence. The learning engine can start off collecting all metrics, then progressively turning them off as they deem metrics unnecessary. This could also extend to controlling log-levels (ie how much granularity of data is generated for a particular log, event, performance counter)
- Perhaps we even offer AI-as-a-service, whereby any of the components within the black-box can call upon a centralised AI service (and the common data lake that underpins it) to assist with localised self-healing, self-regulation, etc. This facilitates closed-loop decisions throughout the stack rather than just an over-arching command and control mechanism
I’m barely exposing the tip of the iceberg here. I’d love to get your thoughts on what else it will take to bring fully autonomous network to reality.