The challenging thing about establishing root-cause is that the rules tend to be fairly unique to each network. Each vendors, topologies, interface specs, etc tend to be quite different, so they need to be customised to each network.
But there are a few rules that can be applied to any network. Yesterday we described a root-cause algorithm called Root Cause Trace (RCT). Today we’ll look at Root Cause by Hierarchy (RCH).
When the cause of a fault happens within a domain, then it tends to be easier to resolve than if it goes cross-domain. RCT and RCH are examples of cross-domain RCA (Root Cause Analysis) techniques.
The diagram below shows two two halves of a network sub-section. The upper half shows the physical connectivity (with “circuits” overlaid as dotted lines).
The bottom half shows how this connectivity can be drawn as network layers. Not exactly OSI layers, but there are some parallels. It could relate to OSI, but it all depends on how you’ve structured your object / data hierarchy in your inventory (LNI / PNI tool).
The concept behind RCH is that if you have an alarm on one of the lower-layers in the hierarchy, then it is the root-cause and all related alarms from upper layers can be associated / suppressed.
For example, if there are Loss of Signal alarms on the two Line Ports on the ADMs (SDH Add Drop Multiplexers), then it’s likely to be a break in the physical path that the Digital Link traverses [Note that you could apply the RCT rule from yesterday’s post to determine which patchlead, cable, joint or ODF is the likely culprit].
Therefore, any alarms coming from the Tributary Ports on the ADM, or any alarms emanating from the VNFs (ie SPF and UPF) are resulting from the physical path break (ie they’re in the higher layers).
You’ll notice that I’ve shown the OSP (Outside Plant) Containment layers – not because they necessarily have a direct impact on the example above. However, if the scenario was a backhoe cutting through a duct, multiple subducts and multiple cables, then there would be an alarm storm that generates alarms extending far beyond the infrastructure shown in the diagram above. In that case, the damaged duct is the real root-cause, which has likely also damaged sub-ducts and cables.
Note: For simplicity, I’ve excluded other layers of containment (eg buildings, pits, poles, towers, etc) from the diagram. Note that I’ve also simplified the network to exclude an SDH ring (with protection) and intermediate routing points, etc. However, the RCH concept becomes even more helpful across those more complex, multi-layer, cross-domain network scenarios.
You may also recall from the earlier article, “Proximity and Root Cause” that we talked about how RCH could actually help to resolve some complex alarms between layers in virtualised networks like 5G and SDN. If NFVI, VIM, VNFM, NFVO and EMS/NMS all store information separately with no way of correlating between layers, then hierarchical data from LNI / PNI with Root Cause by Hierarchy analysis could come in handy.