Is your service assurance really service assurance?? (Part 4)

Yesterday’s post introduced the concept of active measurements as the better method for monitoring and assuring customer services.

Like the rest of this series, it borrowed from an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.”

Interestingly, I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because it introduces the concept of telemetry on spans (not just application nodes). Tomorrow’s article will explore how the concept of spans / traces / metrics / logs for apps might provide insight into the challenge we face getting true end-to-end metrics from our networks (as opposed to the easy to come by nodal metrics).

In the network world, we’re good at getting nodal metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains, or an aggregation of spans in OpenTelemetry nomenclature). And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering nodal logs / events, which seems a bit backward to me.

Table 4 (from the Netrounds white paper link above) provides a view of the most common AI/ML techniques used. Classification and Clustering are useful techniques for alarm / event “optimisation,” (filter, group, correlate and prioritize alarms). That is, to effectively minimise the number of alarms / events a NOC operator needs to look at. In effect, traditional data collection allows AI / ML to remove the noise, but still leaves the problem to be solved manually (ie network assurance, not service assurance)

They’re helping optimise network / resource problems, but not solving the more important service-related problems, as articulated in Table 5 below (again from Netrounds).

If we can directly collect trace data (ie the “active measurements” described in yesterday’s post), we have the data to answer specific questions (which better aligns with our narrow AI technologies of today). To paraphrase questions in the Netrounds white paper, we can ask:

Has the digital service been properly activated.
What service level is currently being experienced by customers (and are SLAs being met)
Is there an outage or degradation of end to end service chains (established over multi-domain, hybrid and multi-layered networks)
Does feedback need to be applied (eg via an orchestration solution) to heal the network

PS. Since we spoke about the AI / ML techniques of Classification and Clustering above, you might want to revisit an earlier post that discusses a contrarian approach to root-cause analysis that could use them too – Auto-releasing chaos monkeys to harden your network (CT/IR).

September 5, 2019
Ryan

If you found this article useful or valuable, subscribe (in the top-right corner of this page) and share. Let's spread the word and inspire more people to become passionate about OSS. Ryan is Passionate About OSS and has dedicated the last two decades to sharing his passion for OSS with the world. He is a founder, author, blogger, Engineer, connector and inquisitive learner about OSS and managing networks. To find out a little about his back-story and why he's so Passionate About OSS, click on the About Page. To connect with Ryan and the PAOSS team, click on the Contact page.

All Posts