Seems this post from last week has triggered some really interesting debate – Is your service assurance really service assurance?? (Part 5). It was a post that looked into collecting end-to-end service metrics rather than our traditional method of collecting network device events/metrics and trying to reverse-engineer to form a service-level perspective.
Thought I’d give you an update. I’m thinking along the following lines, but admit that I don’t have it all worked out by any means yet:
- We need to concept of span like OpenTelemetry does between microservices (in a way, it’s like nearest-neighbour of where each packet is getting pushed).
Note that for us a span is on a service-by-service basis between nodes, not just a network link-by-link basis between nodes
- We need to be able to measure the real-time metrics of the performance of each span as well as any events/faults impacting them
- One challenge (one of probably many) is how to avoid flooding the data/management planes. Possibly a telemetry beacon at each node that’s aggregating performance/events of each packet passed for each service?? But what aggregation-window / cache-size to use? Still too impossibly huge to process except with ridiculously low sampling rates??
- By chaining the spans we get a real-time, end-to-end trace of services and the performance (and real-time snapshot of service-by-service resource usage in a packet-switched network)
- How to efficiently get the beacon data to a centralised logging/management point? Send beacons via management plane? Send via data plane? Take an approach similar to Netflow / IPFIX-style protocols?
- How to store data for a short period (ie for real-time analysis/reporting) as well as for long periods. Due to volumes, we’d have to apply aging policies to the data, but it would still be valuable for the purpose of mid and long-term SLA, network health, optimisation, capacity management, etc
As you can see, there are still so many wide-open questions about the feasibility of the concept. But getting feedback from multiple very clever people who read this blog is definitely helping! Thank you!!Read the Passionate About OSS Blog for more or Subscribe to the Passionate About OSS Blog by Email