When it comes to identifying root-cause (ie to identify the actual thing that’s broken / degraded rather than the all of the other things that are affected downstream), I tend to think of proximity:
- Proximity in topology (ie nearest neighbours)
- Proximity by geography
- Proximity in time
- Proximity by object hierarchy (think OSI stack)
When devices generate alarms / logs, they don’t tend to have much in the way of proximal information though. The proximal information tends to arrive via an enrichment process by our OSS (or maybe the NMS that sits between device and OSS).
The use of topology to assist with root-cause calculation is commonplace, so I won’t go into it here. You’ve probably also seen alarm states visualised as map overlays, so there’s nothing unusual here. However, we will take a closer look at the last two items in the list above.
Alarms / logs are timestamped, but time proximity is only achieved when viewed relative to the timestamps of other alarms / logs. The human brain can easily process proximity in time, but only if we provide suitable visualisation. Sequencing by timestamp is easy enough, but I’m a little surprised that our tools don’t make more use of sliders that allow us to readily scrub backwards and forwards in time (on historical events, or perhaps even projected future events). Perhaps long poll cycles (ie the time interval between requesting information from a device) can cloud the effectiveness of time proximity.
Nonetheless, time-scrubbers do increase the power of topology / geo views of alarm data too. They allow us to more readily see the ripple-out effect and hence deduce roughly where the event occurred (like where a rock landed in a pond based on where the concentric rings are emanating from).
Object hierarchy is another proximity technique that doesn’t tend to be used very often, mainly because Fault Management tools don’t tend to store that information. For example, if a cable has been cut (layer 1 in OSI), then it’s common for child alarms to come from higher layers (eg data link, network, transport, session). RCA (Root Cause Analysis) rules can easily determine the root-cause (cable cut) to correlate and suppress higher layer alarms… but only if they have a reliable object hierarchy to refer to. Our Inventory (LNI / PNI) solutions *should* be able to store object hierarchies.
It’s interesting though. I’m hearing that our industry is having trouble identifying root cause between the layers in our modern virtualised networks like 5G. As I indicated in this post about a 5G inventory prototype, we’d probably never store applications, VNFs, NFVI, VIM, VNFM, NFVO, etc as layers in our inventory solution….. Unless it can actually add value to root-cause by object hierarchy proximity…. hmmm… I wonder????
I’d love to hear your thoughts on this one. Leave us a comment below.
BTW. If RCA interests you, you might like to take a look at this old post that describes the steps for building up a systematic RCA pipeline.