As you will have seen in recent articles, we’re due to release our latest report in a couple of weeks. It’s called, “AIOps of the Future.”
During discussions with various people during the preparation of this report, we’ve noticed that some people believe they will get a silver bullet that will resolve most network events automatically, straight out of the box.
Whilst these solutions certainly do deliver some benefits (eg reducing incident noise, auto-ticketing, identification of event patterns, anomaly detection, etc) out of the box [after training on the data baseline], there is still a lot of work to be done by operations teams.
Many of you will have seen this graph below before. We use it somewhat generically to represent a lot of different long-tail situations. In today’s case, we’ll use it to outline the cumulative benefits of effort applied to AIOps implementations.
Let’s assume this graph represents groupings of similar event patterns along the X-axis (as identified by AIOps classification Machine Learning) and the Y-axis identifies the number of instances of each pattern.
It therefore makes sense that we seek to prioritise the automation of each pattern starting from the left of the graph and working out towards the right.
If we look to automate 5 patterns per month (and assuming the rules automate every instance of a pattern), then we see a cumulative benefit graph that looks like the following. As you’ll notice, “the machine” is progressively handling more and more of the events automatically. Naturally, there’s less and less that need to be handled by humans (ie the gap between the top of the blue bar and the orange line of total instances). The orange line also represents the asymptote towards which the automations track. Whether it’s possible to automate everything in the long tail above remains to be seen.
A couple of extra notes about this highly conceptualised example:
- Rather than prioritising on tackling the largest instance volumes (ie starting left-to-right in the long-tail diagram), you might instead wish to focus on the instances that generate Sev1 or Sev2 outages (eg as marked in the red bars on the long-tail). There are various other prioritisation approaches mentioned in our report BTW.
- For simplicity, the lower graph assumes the exact same instance numbers recur over time. In reality, the instance numbers are constantly changing as new instances arise
- As we progress further to the right of the long-tail, we may get to a point where we decide that the cost of automation outweighs the benefits
- As the lower graph moves from left-to-right over time and gets ever closer to the asymptote of complete automation, there will be many changes occurring in the Ops model – technical, cultural, behavioural, etc. Our report also talks about the implications of this