The cumulative benefits of AIOps

As you will have seen in recent articles, we’re due to release our latest report in a couple of weeks. It’s called, “AIOps of the Future.”

During discussions with various people during the preparation of this report, we’ve noticed that some people believe they will get a silver bullet that will resolve most network events automatically, straight out of the box.

Whilst these solutions certainly do deliver some benefits (eg reducing incident noise, auto-ticketing, identification of event patterns, anomaly detection, etc) out of the box [after training on the data baseline], there is still a lot of work to be done by operations teams.

Many of you will have seen this graph below before. We use it somewhat generically to represent a lot of different long-tail situations. In today’s case, we’ll use it to outline the cumulative benefits of effort applied to AIOps implementations.

Let’s assume this graph represents groupings of similar event patterns along the X-axis (as identified by AIOps classification Machine Learning) and the Y-axis identifies the number of instances of each pattern.

It therefore makes sense that we seek to prioritise the automation of each pattern starting from the left of the graph and working out towards the right.

If we look to automate 5 patterns per month (and assuming the rules automate every instance of a pattern), then we see a cumulative benefit graph that looks like the following. As you’ll notice, “the machine” is progressively handling more and more of the events automatically. Naturally, there’s less and less that need to be handled by humans (ie the gap between the top of the blue bar and the orange line of total instances). The orange line also represents the asymptote towards which the automations track. Whether it’s possible to automate everything in the long tail above remains to be seen.

A couple of extra notes about this highly conceptualised example:

Rather than prioritising on tackling the largest instance volumes (ie starting left-to-right in the long-tail diagram), you might instead wish to focus on the instances that generate Sev1 or Sev2 outages (eg as marked in the red bars on the long-tail). There are various other prioritisation approaches mentioned in our report BTW.
For simplicity, the lower graph assumes the exact same instance numbers recur over time. In reality, the instance numbers are constantly changing as new instances arise
As we progress further to the right of the long-tail, we may get to a point where we decide that the cost of automation outweighs the benefits
As the lower graph moves from left-to-right over time and gets ever closer to the asymptote of complete automation, there will be many changes occurring in the Ops model – technical, cultural, behavioural, etc. Our report also talks about the implications of this

September 29, 2023
Ryan

If you found this article useful or valuable, subscribe (in the top-right corner of this page) and share. Let's spread the word and inspire more people to become passionate about OSS. Ryan is Passionate About OSS and has dedicated the last two decades to sharing his passion for OSS with the world. He is a founder, author, blogger, Engineer, connector and inquisitive learner about OSS and managing networks. To find out a little about his back-story and why he's so Passionate About OSS, click on the About Page. To connect with Ryan and the PAOSS team, click on the Contact page.

All Posts