Is your service assurance really service assurance?? (Part 4)

Yesterday’s post introduced the concept of active measurements as the better method for monitoring and assuring customer services.

Like the rest of this series, it borrowed from an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.”

Interestingly, I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because it introduces the concept of telemetry on spans (not just application nodes). Tomorrow’s article will explore how the concept of spans / traces / metrics / logs for apps might provide insight into the challenge we face getting true end-to-end metrics from our networks (as opposed to the easy to come by nodal metrics).

In the network world, we’re good at getting nodal metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains, or an aggregation of spans in OpenTelemetry nomenclature). And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering nodal logs / events, which seems a bit backward to me.

Table 4 (from the Netrounds white paper link above) provides a view of the most common AI/ML techniques used. Classification and Clustering are useful techniques for alarm / event “optimisation,” (filter, group, correlate and prioritize alarms). That is, to effectively minimise the number of alarms / events a NOC operator needs to look at. In effect, traditional data collection allows AI / ML to remove the noise, but still leaves the problem to be solved manually (ie network assurance, not service assurance)

They’re helping optimise network / resource problems, but not solving the more important service-related problems, as articulated in Table 5 below (again from Netrounds).

If we can directly collect trace data (ie the “active measurements” described in yesterday’s post), we have the data to answer specific questions (which better aligns with our narrow AI technologies of today). To paraphrase questions in the Netrounds white paper, we can ask:

  • Has the digital service been properly activated.
  • What service level is currently being experienced by customers (and are SLAs being met)
  • Is there an outage or degradation of end to end service chains (established over multi-domain, hybrid and multi-layered networks)
  • Does feedback need to be applied (eg via an orchestration solution) to heal the network

PS. Since we spoke about the AI / ML techniques of Classification and Clustering above, you might want to revisit an earlier post that discusses a contrarian approach to root-cause analysis that could use them too – Auto-releasing chaos monkeys to harden your network (CT/IR).

Read the Passionate About OSS Blog for more or Subscribe to the Passionate About OSS Blog by Email

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.