It’s been nearly two decades since I designed my first root-cause analysis (RCA) rule. It was completely reliant on network topology – more specifically, it relied on a network hierarchy to determine which alarms could be suppressed.
I had a really interesting discussion today with some colleagues who are using much more modern RCA techniques. I was somewhat surprised, but not surprised at all in hindsight, that their Machine Learning engine doesn’t even use topology data. It just looks at events and tries to identify patterns.
That’s a really interesting insight that hadn’t dawned on me before. But it’s an exciting one because it effectively unshackles our fault management tools from data quality perfection in our inventory / asset databases. It also possibly lessens the need for integrations that share topological data.
Equally interesting, the ML engine had identified over 4,000 patterns, but only a dozen had been codified and put into use so far. In other words, the machine was learning, but humans still needed to get involved in the process to confirm that the machine had learned correctly.
Makes me wonder whether the ML pre-seeding technique we discussed in an earlier post might actually be useful for confirmations at a greater scale than the team had achieved with 12 of 4000+ to date.
The standard approach is to let the ML loose and identify patterns. This is the reactive approach. The ML reacts to the alarms that are pushed up from the network. It looks at alarms and determines what the root cause is based on historical data. A human then has to check that the root cause is correct by reverse engineering the alarm stream (just like a network operator used to do before RCA tools came along) and comparing. If the comparison is successful, the person then approves this pattern.
My proposed alternate approach is the proactive method. If we proactively trigger a fault (e.g. pull a patch lead, take a port down, etc), we start from a position of already knowing what the root cause is. This has three benefits:
1. We can check if the ML’s interpretation of root cause is right
2. We’ve proactively seeded the ML’s data with this root cause example
3. We categorically know what the root cause is, unlike the reactive mode which only assumes the operator has correctly diagnosed the root cause
Then we just have to figure out a whole bunch of proactive failures to test safely. Where to start? Well, I’d speak with the NOC operators to find out what their most common root causes are and look to trigger those events.
More tomorrow on intentionally triggering failures in production systems.Read the Passionate About OSS Blog for more or Subscribe to the Passionate About OSS Blog by Email