Root cause analysis (RCA) is one of the great challenges of OSS. As you know, it aims to identify the probable cause of an alarm-storm, where all alarms are actually related to a single fault.
In the past, my go-to approach was to start with a circuit hierarchy-based algorithm. If you had an awareness of the hierarchy of circuits, usually through an awareness in inventory, if you have a lower-order fault (eg Loss of Signal on a transmission link caused by a cable break), then you could suppress all higher-order alarms (ie from bearers or tributaries that were dependent upon the L1 link. That works well in the fixed networks of distant past (think SDH / PDH). This approach worked well because it was repeatable between different customer environments.
Packet-switching data networks changed that to an extent, because a data service could traverse any number of links, on-net or off-net (ie leased links). The circuit hierarchy approach was still applicable, but needed to be supplemented with other rules.
Now virtualised networking is changing it again. RCA loses a little relevance in the virtualised layer. Workloads and resource allocations are dynamic and transient, making them less suited to fixed algorithms. The objective now becomes self-healing – if a failure is identified, failed resources are spun down and new ones spun up to take the load. The circuit hierarchy approach loses relevance, but perhaps infrastructure hierarchy still remains useful. Cable breaks, server melt-downs, hanging controller applications are all examples of root causes that will cause problems in higher layers.]
Rather than fixed-rules, machine-based pattern-matching is the next big hope to cope with the dynamically changing networks.
The number of layers and complexity of the network seems to be ever increasing, and with it RCA becomes more sophisticated…. If only we could evolve to simpler networks rather than more complex ones. Wishful thinking?