Proximity and Root-Cause

When it comes to identifying root-cause (ie to identify the actual thing that’s broken / degraded rather than the all of the other things that are affected downstream), I tend to think of proximities / adjacencies:

  1. Proximity in topology (ie nearest neighbours)
  2. Proximity by geography
  3. Proximity in time (temporal proximity and seasonality)
  4. Proximity by object hierarchy (think OSI stack, particularly for cross-domain entities)
  5. Proximity by context

When devices generate alarms / logs / events arrive, they don’t tend to provide much in the way of proximal information though. For example, they will almost never have geo-coordinates of where the event occurred. The proximal information we’ll talk about below tends to only become available via an enrichment process by our OSS (or maybe the NMS that sits between device and OSS).

The use of topology proximity to assist with root-cause calculation is common with existing monitoring and AIOps tools, so I won’t go into it in much detail here. Think of this as a contagion effect, where outages can ripple out to nearest neighbours they’re connected to in a network. You’ve probably also seen alarm states visualised as map overlays, with clusters of red showing the contagion effect, so you’re sure to be familiar with this one.

However, we will take a closer look at the other proximities below.

Alarms / logs are timestamped, but time proximity is only achieved when viewed relative to the timestamps of other alarms / logs / events (including timestamps relative to logs from adjacent devices). The human brain can easily process proximity in time, but only if we provide suitable visualisation and referencing. Sequencing by timestamp is easy enough, but I’m a little surprised that our tools don’t make more use of sliders that allow us to readily scrub backwards and forwards in time (on historical events, or perhaps even predict forward against projected future events such as threshold crossing alarms). One of the challenges we face is that long poll + processing cycles (ie the time interval between when an event actually happens and when it becomes visible in monitoring systems) can cloud the effectiveness of time proximity. Time proximity can also consider elements of seasonality from historical trend data. Time proximity doesn’t just take into account an immediate time window (eg +/- 5 minutes), but can also consider seasonality (time of day/month/week/year/event).

Time-scrubbers also have the potential to increase the power of topology / geo-proximity visualisation of alarm/event data too. They allow us to more readily see the ripple-out effect and hence deduce roughly where the event occurred (like where a rock landed in a pond based on where the concentric rings are emanating from). In addition to awareness of spatial “closeness” of network assets, Geo-proximity can also include non-traditional data sets such as weather events (lightning strikes, rainfall, cyclones, etc), power outage maps, radio coverage maps, etc.

Object hierarchy proximity is another proximity technique that doesn’t tend to be used very often, mainly because Fault Management tools don’t tend to store that information. For example, if a cable has been cut (layer 1 in OSI), then it’s common for child alarms to come from higher layers (eg data link, network, transport, session). RCA (Root Cause Analysis) rules can easily determine the root-cause (cable cut) to correlate and suppress higher layer alarms… but only if they have a reliable object hierarchy to refer to. Our Inventory (LNI / PNI) solutions *should* be able to store object hierarchies. Our OSS also tend to store customer and service relationships with the resources they use within these object hierarchies. This is used for Service Impact Analysis (SIA), but also has the potential to be useful for operational purposes.

Semantic / Contextual proximity requires an understanding of the context behind proximal data points. For example, an alarm indicating a “high CPU usage” event may be semantically close to an alarm for “low available memory,” on the same or similar devices as both could be indicators of resource constraints. This one could also consider other dependencies such as common users / groups, applications, changes, etc.

All of these forms of proximity can assist in the correlation process used by traditional monitoring or AIOps. 

As an aside, I’m hearing that our industry is currently having trouble identifying root cause between the layers in our modern virtualised networks like 5G. As I indicated in this post about a 5G inventory prototype, we’d probably never store applications, VNFs, NFVI, VIM, VNFM, NFVO, etc as layers in our inventory solution….. Unless it can actually add value to root-cause by object hierarchy proximity…. hmmm… I wonder????

I’d love to hear your thoughts on this one. Leave us a comment below.

BTW. If RCA interests you, you might like to take a look at this old post that describes the steps for building up a systematic RCA pipeline

If this article was helpful, subscribe to the Passionate About OSS Blog to get each new post sent directly to your inbox. 100% free of charge and free of spam.

Our Solutions

Share:

Most Recent Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.