Moving from traditional assurance to AIOps, what are the differences?

We’re going to look into assurance models of the past versus the changing assurance demands that are appearing these days. The diagrams below are highly stylised for discussion purposes so they’re unlikely to reflect actual implementations, but we’ll get to that.

Old Assurance Architecture
Under the old model, the heart of the OSS/BSS was the database (almost exclusively a relational database). It would gather data, via probes/MDDs/collectors, from the network under management (Note: I’ve shown the sources as devices, but they could equally be EMS/NMS). The mediation device drivers (MDDs) would take feeds from the network and homogenise them to be suitable for very precise loading into tables in the relational databases.

This data came in the form of alarms/events, performance counters, syslogs and not much else. It could come in all sorts of common (eg SNMP) or obscure forms / protocols. Some would come via near-real-time notifications, but a lot was polled at cycles such as 5 or 15 mins.

Then the OSS/BSS applications (eg Assurance, Inventory, etc) would consume data from the database and write other data back to the database. There were some automations, such as hard-coded suppression rules, etc.

The automations were rarely closed-loop (ie to actually resolve the assurance challenge). There were also software assistants such as trendlines and threshold alerts to help capacity planners.

There was little overlap into security assurance – occasionally there might have even been a notification of device configuration varying from a golden config or indirect indicators through performance graphs / thresholds.

New Assurance Architecture
But so many aspects of the old world have been changing within our networks and IT systems. The active network, the resilience mechanisms, the level of virtualisation, the release management methods, containerisation, microservices, etc. The list goes on and the demands have become more complex, but also far more dynamic.

Let’s start with the data sources this time, because this impacts our choice of data storage mechanism. We still receive data from the active network devices (and EMS/NMS), but we also now source data from other sources. They might be internal sources from IT, security, etc, but could also be external sources like social indicators. The 4 Vs of data between old and new models have fundamentally changed:

  • Volume – we’re seeing far more data
  • Variety – the sources are increasing and the structure of data is no longer as homogenised as it once was (in fact unstructured data is now commonplace)
  • Velocity – we’re receiving incoming data at any number of different velocities, often at far higher frequency than the 15 minute poll cycles of the past
  • Veracity (or trustworthiness) – our systems of old relied on highly dependable data due to its relational nature and could easily become a data death spiral if data quality deteriorated. Now we accept data with questionable integrity and need to work around it

Again the data storage mechanism is at the heart of the solution. In this case it’s a (probably) unstructured data lake rather than a relational database because of the 4 Vs above. The data that it stores must still be stored in a way that allows cross-referencing to happen with other data sets (ie the role of the indexer), but not as homogenised as a relational database.

The 4 Vs also fundamentally change the way we have to make use of the data. It surpasses our ability to process in a manual or semi-manual way (where semi-manual implies the traditional rules-based automations like suppression, root-cause analysis, etc). We have no choice but to increase dependency on machine-driven tools as automations need to become:

  • More closed-loop in nature – that is, to not just consolidate and create a ticket, but also to automate the resolution and ticket closure
  • More abundant – doing even more of the mundane, recurring tasks like auto-sizing resources (particularly virtual environments), restarting services, scheduling services, log clean-up, etc

To be honest, we probably passed the manual/semi-manual tipping point many years ago. In the meantime we’ve done as best we could, eagerly waiting until the machine tools like ML (Machine Learning) and AI (Artificial Intelligence) could catch up and help out. This is still fairly nascent, but AIOps tools are becoming increasingly prevalent.

The exciting thing is that once we start harnessing the potential of these machine tools, our AIOps should allow us to ask far more than just network health questions like past models. They could allow us to ask marketing / cost / capacity / profitability / security questions like:

  • Where do I urgently need to increase capacity in the network (and can you automatically just make this happen – a more “just in time” provisioning of resources rather than planning months ahead)
  • Where could I re-position capacity around the network to reduce costs, improve performance, improve customer experience, meet unmet demand
  • Where should sales/marketing teams focus their efforts to best service unmet demand (could be based on current network or in sequence with network build-out that’s due to become ready-for-service)
  • Where are the areas of un-met demand compared with our current network footprint
  • With an available budget of $x, is it best spent on which ratio of maintenance, replacement, expansion and where
  • How do we better understand profitability vectors in the network compared to just the more crude revenue metrics (note that profitability vectors could include service density, amount of maintenance on the supporting infrastructure, customer interactions, churn, etc on a geographic or similar basis)
  • Where (and how) can we progressively automate a more intent or policy-driven auto-remediation of the network (if we don’t already have a consistent approach to config management)
  • What policies can we tweak to get better performance from the network on a more real-time basis (eg tweaking QoS policies based on current traffic in various parts of the network)
  • Can we look at past maintenance trends to automatically set routine maintenance schedules that are customised by device, region, device type, loads, etc rather than using a one-size-fits-all maintenance schedule
  • Can I determine, on a real-time basis, what services are using which resources to get a true service impact estimate in a dynamic, packet-switched network environment
  • What configurations (or misconfigurations) in the network pose security vulnerability threats
  • If a configuration change is identified, can it be automatically audited and reported on (and perhaps even quarantined) before then being authorised (manually or automatically?)
  • What anomalies are we seeing that could represent security events
  • Can we utilise end-to-end constructs such as network services, customer services, product lifecycle, device lifecycle, application performance (as well as the traditional network performance) to enhance context and correlation
  • And so many more that can’t be as easily accommodated by traditional assurance tools
Read the Passionate About OSS Blog for more or Subscribe to the Passionate About OSS Blog by Email

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.