We use time-stamping in OSS, but what about geo-stamping?

A slightly left-field thought dawned on me the other day and I’d like to hear your thoughts on it.

We all know that almost all telemetry coming out of our networks is time-stamped. Events, syslogs, metrics, etc. That makes perfect sense because we look for time-based ripple-out effects when trying to diagnose issues.

But therefore does it also make sense to geo-stamp telemetry data too? Just as time-based ripple-out is common, so too are geographic / topological (eg nearest neighbour and/or power source) ripple-out effects.

If you want to present telemetry data as a geo/topo overlay, you currently have to enrich the telemetry data set first. Typically that means identifying the device name that’s generating the data and then doing a query on huge inventory databases to find the location and connectivity that corresponds to that device.

It’s usually not a complex query, but consider how much processing power must go into enriching at the enormous scale of telemetry records.

For stationary devices (eg core routers), it might seem a bit absurd adding a fixed geo-code (which has to be manually entered into the device once) to every telemetry record, but it seems computationally far more efficient than data lookups (please correct me if I’m wrong here!). For devices that move around (eg routers on planes), hopefully they already have GPS sensors to provide geo-stamp data.

What do you think? Am I stating a problem that has already been solved and/or is not worth solving? Or does it have merit?

The Autonomous Network / OSS Clock

In yesterday’s post, we talked about what needs to happen for a network operator to build an autonomous network. Many of the factors extended beyond the direct control of the OSS stack. We also looked at the difference between designing network autonomy for an existing OSS versus a ground-up build of an autonomous network.

We mostly looked at the ground-up build yesterday (at the expense of legacy augmentation).

So let’s take a slightly closer look at legacy automation. Like any legacy situation, you need to first understand current state. I’ve heard colleagues discuss the level of maturity of an existing network operations stack in terms of a single metric.

However, I feel that this might miss some of the nuances of the situation. For example, different activities are likely to be at different levels of maturity. Hence, the attempt at benchmarking the current situation on the OSS or Autonomous Networking clock below.

OSS Autonomy Clock

Sample activities shown in grey boxes to demonstrate the concept (I haven’t invested enough time into what the actual breakdown of activities might be yet).

  • Midnight is no monitoring capability
  • 3AM is Reactive Mode (ie reacting to data presented by the network / systems)
  • 6AM is Predictive Mode (ie using historical learnings to identify future situations)
  • 9AM is Prescriptive / Pre-cognitive Mode (ie using historical learnings, or pre-cognitive capabilities to identify what to do next)
  • Mid-day is Autonomous Networking (ie to close the loop and implement / control actions that respond to current situations automatically)

As always, I’d love to hear your thoughts!

As a network owner….

….I want to make my network so observable, reliable, predictable and repeatable that I don’t need anyone to operate it.

That’s clearly a highly ambitious goal. Probably even unachievable if we say it doesn’t need anyone to run it. But I wonder whether this has to be the starting point we take on behalf of our network operator customers?

If we look at most networks, OSS, BSS, NOC, SOC, etc (I’ll call this whole stack “the black box” in this article), they’ve been designed from the ground up to be human-driven. We’re now looking at ways to automate as many steps of operations as possible.

If we were to instead design the black-box to be machine-driven, how different would it look?

In fact, before we do that, perhaps we have to take two unique perspectives on this question:

  1. Retro-fitting existing black-boxes to increase their autonomy
  2. Designing brand new autonomous black-boxes

I suspect our approaches / architectures will be vastly different.

The first will require a incredibly complex measure, command and control engine to sit over top of the existing black box. It will probably also need to reach into many of the components that make up the black box and exert control over them. This approach has many similarities with what we already do in the OSS world. The only exception would be that we’d need to be a lot more “closed-loop” in our thinking. I should also re-iterate that this is incredibly complex because it inherits an existing “decision tree” of enormous complexity and adds further convolution.

The second approach holds a great deal more promise. However, it will require a vastly different approach on many levels:

  1. We have to take a chainsaw to the decision tree inside the black box. For example:
    • We start by removing as much variability from the network as possible. Think of this like other utilities such as water or power. Our electricity service only has one feed-type for almost all residential and business customers. Yet it still allows us great flexibility in what we plug into it. What if a network operator were to simply offer a “broadband dial-tone” service and end users decide what they overlay on that bit-stream
    • This reduces the “protocol stack” in the network (think of this in terms of the long list of features / tick-boxes on any router’s brochure)
    • As well as reducing network complexity, it drastically reduces the variables an end-user needs to decide from. The operator no longer needs 50 grandfathered, legacy products 
    • This also reduces the decision tree in BSS-related functionality like billing, rating, charging, clearing-house
    • We achieve a (globally?) standardised network services catalog that’s completely independent of vendor offerings
    • We achieve a more standardised set of telemetry data coming from the network
    • In turn, this drives a more standardised and minimal set of service-impact and root-cause analyses
  2. We design data input/output methods and interfaces (to the black box and to any of its constituent components) to have closed-loop immediacy in mind. At the moment we tend to have interfaces that allow us to interrogate the network and push changes into the network separately rather than tasking the network to keep itself within expected operational thresholds
  3. We allow networks to self-regulate and self-heal, not just within a node, but between neighbours without necessarily having to revert to centralised control mechanisms like OSS
  4. All components within the black-box, down to device level, are programmable. [As an aside, we need to consider how to make the physical network more programmable or reconcilable, considering that cables, (most) patch panels, joints, etc don’t have APIs. That’s why the physical network tends to give us the biggest data quality challenges, which ripples out into our ability to automate networks]
  5. End-to-end data flows (ie controls) are to be near-real-time, not constrained by processing lags (eg 15 minute poll cycles, hourly log processing cycles, etc) 
  6. Data minimalism engineering. It’s currently not uncommon for network devices to produce dozens, if not hundreds, of different metrics. Most are never used by operators manually, nor are likely to be used by learning machines. This increases data processing, distribution and storage overheads. If we only produce what is useful, then it should improve data flow times (point 5 above). Therefore learning machines should be able to control which data sets they need from network devices and at what cadence. The learning engine can start off collecting all metrics, then progressively turning them off as they deem metrics unnecessary. This could also extend to controlling log-levels (ie how much granularity of data is generated for a particular log, event, performance counter)
  7. Perhaps we even offer AI-as-a-service, whereby any of the components within the black-box can call upon a centralised AI service (and the common data lake that underpins it) to assist with localised self-healing, self-regulation, etc. This facilitates closed-loop decisions throughout the stack rather than just an over-arching command and control mechanism

I’m barely exposing the tip of the iceberg here. I’d love to get your thoughts on what else it will take to bring fully autonomous network to reality.

Net Simplicity Score (NSS) gets a little more complex

In last Tuesday’s post, I asked the community here on PAOSS and on TM Forum’s Engage platform for ideas about how you would benchmark complexity.

I also provided a reference to an old post that described the concept of a NSS (Net Simplicity Score) for our OSS/BSS.

Due to the complexity of factors that contribute to a complexity score, the NSS is a “catch-all” simplicity metric. Hopefully it will allow subtraction projects to be easily justified, just as the NPS (Net Promoter Score) metric has helped justify customer experience initiatives.

The NSS (Net Simplicity Score), could be further broken down into:

  • The NCSS (Net Customer Simplicity Score) – A ranking from 0 (lowest) to 10 (highest) how easy is it to choose and use the company / product / service? This is an external metric (ie the ranking of the level of difficulty that your customers face)
  • The NOSS (Net Operator Simplicity Score) – A ranking from 0 (lowest) to 10 (highest) how easy is it to choose and use the company / product / service? This is an internal metric (ie for operators to rank complexity of systems and their constituent applications / data / processes)

One interesting item of feedback came from Ronald Hasenberger. He rightly pointed out that just because something is simple for users to interact with, doesn’t mean it’s simple behind the scenes – often exactly the opposite. The iPod example I used in earlier posts is a case in point. The iPod was more intuitive than existing MP3 players, but a huge amount of design and engineering went into making it that way. The underlying “system” certainly wasn’t simple.

So perhaps there’s a third simplicity factor to add to the two bullets listed above:

  • The NSSS (Net System Simplicity Score) – and this one does require a more sophisticated algorithm than just an aggregate of perceptions. Not only that, but it’s the one that truly reflects the systems we design and build. I wonder whether the first two are an initial set of proxies that help drive complexity out of our solutions, but we need to develop Ronald’s third one to make the biggest impact?

Again, I’d love to hear your thoughts!

OSS are not just a #$%&ing cost centre

It seems that OSS/BSS are always an afterthought. And always seen as a cost centre rather than a revenue generator.

Now I’m biased of course, but I think that’s such a narrow view. And we need everyone in our industry to spread the same gospel. 

I like to think of it like this… Sales teams identify the customers and revenue (let’s call them THE BUY-SIDE). Network teams build the assets that service the customer needs (let’s call them THE SELL-SIDE). But the OSS/BSS are the profit engine because they bring buyers and sellers together.

They initiate revenues (Fulfillment / Activation workflows), they retain revenues (Assurance workflows) and they can identify, then minimise costs (automations, analytics, leakage management, identify ineffective work practices, identify simplification opportunities, workforce coordination, etc, etc). They also have a strong influence on customer experience.

OSS/BSS operationalise the assets (to deliver the services that customers pay for).

How much revenue do our OSS/BSS operationalise? All of it (unless some orders are for professional services and/or being activated directly on the network without touching OSS or BSS systems)

The OSS/BSS also provide the strategic levers for management to pull in future. In times when long-term competitive advantages are hard to find, your OSS/BSS can give significant competitive advantage (if flexible and effective) or hinder it (if inflexible / unadaptable)

Opinions wanted – How to Benchmark OSS/BSS complexity

I’d love to ask you an important question…  how do we benchmark OSS/BSS complexity? To measure how complex our systems are and therefore provide a signpost for simplification.

A colleague has opined that the number of apps in a stack could be used a proxy. I can see where he’s going with that, but I feel that it doesn’t account for architectural differences such as monolith versus microservices.

I’d love to hear your thoughts via the comments box below.

FWIW, here are some additional thoughts from me, but please don’t let them bias your opinions:

  • For me, complexity relates to the efficiency of getting tasks done:
    • How much time to complete certain tasks
    • How many button clicks
    • How much swivel-chairing
    • How many CPU cycles for automated tasks
    • How much admin overhead
    • How much duplicated effort and/or rework
  • However, there are so many different tasks done within an OSS/BSS stack that it’s difficult to provide a complexity metric that compares one OSS/BSS stack with another. Or compares a single stack before/after changes are made
  • In some cases the complexity happens inside the OSS/BSS “black box” (eg tools within the suite aren’t seamlessly integrated, causing operators to perform dual-entry that leads to data inconsistency and downstream re-work)
  • In other cases the complexity is inherited from outside the black box (eg product offerings have hundreds of possible variants that are imperceptibly different in the customer’s eyes). I call this The OSS Pyramid of Pain
  • In many cases, the complexity of an OSS/BSS stack is less about the systems and integrations, and more about the complexity of The Decision Tree that spans the stack. The spread of the Decision Tree is impacted by:
    • The OSS/BSS applications
    • Support applications (eg authentication, security, data management, resilience / availability, etc)
    • System interfaces (internal and external)
    • User interfaces
    • Process designs
    • Product definitions
    • Work practices
    • Data models
    • Design rules
    • Network topologies
    • etc, etc
  • The more complex the Decision Tree, the more complex it is to transform our OSS. It loosely aligns with what I call this The Chessboard Analogy
  • The development strategy used also has an impact, be monolithic, best-of-breed, hosted or in-house developed. For example an in-house-developed solution is likely to have less functionality-bloat than a COTS (off-the-shelf) solution. The COTS solution needs to include additional functionality to enable it to support requirements of multiple customers
  • And finally, a benchmark is only as useful as the actions that it triggers. How do we codify a complexity metric that has the equally complex array of contributions described above?
  • Perhaps we could take a somewhat abstracted approach like the NPS (Net Promoter Score) does, thus creating a NSS (Net Simplicity Score)

As mentioned above, I’d love to hear your thoughts on how we can benchmark the level of complexity in our OSS/BSS. Please leave your comments below.

 

What will get your CEO fired? (part 4)

In Monday’s article, we suggested that the three technical factors that could get the big boss fired are probably only limited to:

  1. Repeated and/or catastrophic failure (of network, systems, etc)
  2. Inability to serve the market (eg offerings, capacity, etc)
  3. Inability to operate network assets profitably

In that article, we looked closely at a human factor and how current trends of open-source, Agile and microservices might actually exacerbate it. In yesterday’s article we looked at market-serving factors for us to investigate and monitor.

But let’s look at point 3 today. The profitability factors we could consider that reduce the chances of the big boss getting fired are:

  1. Ability to see revenues in near-real-time (revenues are relatively easy to collect, so we use these numbers a lot. Much harder are profitability measures because of the shared allocation of fixed costs)

  2. Ability to see cost breakdown (particularly which parts of the technical solution are most costly, such as what device types / topologies are failing most often)

  3. Ability to measure profitability by product type, customer, etc

  4. Are there more profitable or cost-effective solutions available

  5. Is there greater profitability that could be unlocked by simplification

What will get your CEO fired? (part 3)

In Monday’s article, we suggested that the three technical factors that could get the big boss fired are probably only limited to:

  1. Repeated and/or catastrophic failure (of network, systems, etc)
  2. Inability to serve the market (eg offerings, capacity, etc)
  3. Inability to operate network assets profitably

In that article, we looked closely at a human factor and how current trends of open-source, Agile and microservices might actually exacerbate it. In yesterday’s article we looked at the broader set of catastrophic failure factors for us to investigate and monitor.

But let’s look at some of the broader examples under point 2 today. The market-serving factors we could consider that reduce the chances of the big boss getting fired are:

  1. Immediate visibility of key metrics by boss and execs (what are the metrics that matter, eg customer numbers, ARPU, churn, regulatory, media hot-buttons, network health, etc)

  2. Response to “voice of customer” (including customer feedback, public perception, etc)

  3. Human resources (incl up-skill for new tech, etc)

  4. Ability to implement quickly / efficiently

  5. Ability to handle change (to network topology, devices/vendors, business products, systems, etc)

  6. Measuring end-to-end user experience, not just “nodal” monitoring

  7. Scalability / Capacity (ability to serve customer demand now and into a foreseeable future)

What will get your CEO fired? (part 2)

In Monday’s article, we suggested that the three technical factors that could get the big boss fired are probably only limited to:

  1. Repeated and/or catastrophic failure (of network, systems, etc)
  2. Inability to serve the market (eg offerings, capacity, etc)
  3. Inability to operate network assets profitably

In that article, we looked closely at a human factor and how current trends of open-source, Agile and microservices might actually exacerbate it.

But let’s look at some of the broader examples under point 1 today. The failure factors we could consider that might result in the big boss getting fired are:

  1. Availability (nodal and E2E)

  2. Performance (nodal and E2E)

  3. Security (security trust model – cloud vs corporate vs active network and related zones)

  4. Remediation times, systems & processes (Assurance), particularly effectiveness of process for handling P1 (Priority 1) incidents

  5. Resilience Architecture

  6. Disaster Recovery Plan (incl Backup and Restore process, what black-swan events the organisation is susceptible to, etc)

  7. Supportability and Maintenance Routines

  8. Change and Release Management approaches

  9. Human resources (incl business continuity risk of losing IP, etc)

  10. Where are the SPoFs (Single Points of Failure)

We should note too that these should be viewed through two lenses:

  • The lens of the network our OSS/BSS is managing and
  • The lens of the systems (hardware/software/cloud) that make up our OSS/BSS

Interesting metrics from The Blue Book launch

When I first started the Passionate About OSS site / blog many years ago, I was lucky to get a handful of views per day. It’s grown by many multiples since then, fortunately.

The launch of The Blue Book OSS/BSS Vendor Directory generated some exciting metrics yesterday. The directory alone came within 5 pageviews of the highest count we’ve ever seen on PassionateAboutOSS.com (and PAOSS is up to nearly 2,500 posts now). That total appeared in only a 14-hour window because we didn’t go live with The Directory or metric collection until ~10am local time! The graphs are indicating that we should easily exceed PAOSS’s best ever count today.

If you were one of the many viewers who popped in from all around the world to look at The Directory, thank you! If you have any suggested improvements, we’d love to hear from you as we’re sure to be making many further tweaks in coming days/months.

But the most interesting fact about the launch yesterday was that a job posting appeared on UpWork to scrape all the data we’ve presented. On our very first day!! In fact a gentleman in the US reviewed bids and awarded the UpWork job all within about 14 hours of go-live.

That’s positive news because it means that at least one person must’ve thought the data was useful. 🙂

Moving from traditional assurance to AIOps, what are the differences?

We’re going to look into assurance models of the past versus the changing assurance demands that are appearing these days. The diagrams below are highly stylised for discussion purposes so they’re unlikely to reflect actual implementations, but we’ll get to that.

Old Assurance Architecture
Under the old model, the heart of the OSS/BSS was the database (almost exclusively a relational database). It would gather data, via probes/MDDs/collectors, from the network under management (Note: I’ve shown the sources as devices, but they could equally be EMS/NMS). The mediation device drivers (MDDs) would take feeds from the network and homogenise them to be suitable for very precise loading into tables in the relational databases.

This data came in the form of alarms/events, performance counters, syslogs and not much else. It could come in all sorts of common (eg SNMP) or obscure forms / protocols. Some would come via near-real-time notifications, but a lot was polled at cycles such as 5 or 15 mins.

Then the OSS/BSS applications (eg Assurance, Inventory, etc) would consume data from the database and write other data back to the database. There were some automations, such as hard-coded suppression rules, etc.

The automations were rarely closed-loop (ie to actually resolve the assurance challenge). There were also software assistants such as trendlines and threshold alerts to help capacity planners.

There was little overlap into security assurance – occasionally there might have even been a notification of device configuration varying from a golden config or indirect indicators through performance graphs / thresholds.

New Assurance Architecture
But so many aspects of the old world have been changing within our networks and IT systems. The active network, the resilience mechanisms, the level of virtualisation, the release management methods, containerisation, microservices, etc. The list goes on and the demands have become more complex, but also far more dynamic.

Let’s start with the data sources this time, because this impacts our choice of data storage mechanism. We still receive data from the active network devices (and EMS/NMS), but we also now source data from other sources. They might be internal sources from IT, security, etc, but could also be external sources like social indicators. The 4 Vs of data between old and new models have fundamentally changed:

  • Volume – we’re seeing far more data
  • Variety – the sources are increasing and the structure of data is no longer as homogenised as it once was (in fact unstructured data is now commonplace)
  • Velocity – we’re receiving incoming data at any number of different velocities, often at far higher frequency than the 15 minute poll cycles of the past
  • Veracity (or trustworthiness) – our systems of old relied on highly dependable data due to its relational nature and could easily become a data death spiral if data quality deteriorated. Now we accept data with questionable integrity and need to work around it

Again the data storage mechanism is at the heart of the solution. In this case it’s a (probably) unstructured data lake rather than a relational database because of the 4 Vs above. The data that it stores must still be stored in a way that allows cross-referencing to happen with other data sets (ie the role of the indexer), but not as homogenised as a relational database.

The 4 Vs also fundamentally change the way we have to make use of the data. It surpasses our ability to process in a manual or semi-manual way (where semi-manual implies the traditional rules-based automations like suppression, root-cause analysis, etc). We have no choice but to increase dependency on machine-driven tools as automations need to become:

  • More closed-loop in nature – that is, to not just consolidate and create a ticket, but also to automate the resolution and ticket closure
  • More abundant – doing even more of the mundane, recurring tasks like auto-sizing resources (particularly virtual environments), restarting services, scheduling services, log clean-up, etc

To be honest, we probably passed the manual/semi-manual tipping point many years ago. In the meantime we’ve done as best we could, eagerly waiting until the machine tools like ML (Machine Learning) and AI (Artificial Intelligence) could catch up and help out. This is still fairly nascent, but AIOps tools are becoming increasingly prevalent.

The exciting thing is that once we start harnessing the potential of these machine tools, our AIOps should allow us to ask far more than just network health questions like past models. They could allow us to ask marketing / cost / capacity / profitability / security questions like:

  • Where do I urgently need to increase capacity in the network (and can you automatically just make this happen – a more “just in time” provisioning of resources rather than planning months ahead)
  • Where could I re-position capacity around the network to reduce costs, improve performance, improve customer experience, meet unmet demand
  • Where should sales/marketing teams focus their efforts to best service unmet demand (could be based on current network or in sequence with network build-out that’s due to become ready-for-service)
  • Where are the areas of un-met demand compared with our current network footprint
  • With an available budget of $x, is it best spent on which ratio of maintenance, replacement, expansion and where
  • How do we better understand profitability vectors in the network compared to just the more crude revenue metrics (note that profitability vectors could include service density, amount of maintenance on the supporting infrastructure, customer interactions, churn, etc on a geographic or similar basis)
  • Where (and how) can we progressively automate a more intent or policy-driven auto-remediation of the network (if we don’t already have a consistent approach to config management)
  • What policies can we tweak to get better performance from the network on a more real-time basis (eg tweaking QoS policies based on current traffic in various parts of the network)
  • Can we look at past maintenance trends to automatically set routine maintenance schedules that are customised by device, region, device type, loads, etc rather than using a one-size-fits-all maintenance schedule
  • Can I determine, on a real-time basis, what services are using which resources to get a true service impact estimate in a dynamic, packet-switched network environment
  • What configurations (or misconfigurations) in the network pose security vulnerability threats
  • If a configuration change is identified, can it be automatically audited and reported on (and perhaps even quarantined) before then being authorised (manually or automatically?)
  • What anomalies are we seeing that could represent security events
  • Can we utilise end-to-end constructs such as network services, customer services, product lifecycle, device lifecycle, application performance (as well as the traditional network performance) to enhance context and correlation
  • And so many more that can’t be as easily accommodated by traditional assurance tools

Is your service assurance really service assurance?? (Part 6)

Seems this post from last week has triggered some really interesting debate – Is your service assurance really service assurance?? (Part 5). It was a post that looked into collecting end-to-end service metrics rather than our traditional method of collecting network device events/metrics and trying to reverse-engineer to form a service-level perspective.

Thought I’d give you an update. I’m thinking along the following lines, but admit that I don’t have it all worked out by any means yet:

  1. We need to concept of span like OpenTelemetry does between microservices (in a way, it’s like nearest-neighbour of where each packet is getting pushed).
    Note that for us a span is on a service-by-service basis between nodes, not just a network link-by-link basis between nodes
  2. We need to be able to measure the real-time metrics of the performance of each span as well as any events/faults impacting them
  3. One challenge (one of probably many) is how to avoid flooding the data/management planes. Possibly a telemetry beacon at each node that’s aggregating performance/events of each packet passed for each service?? But what aggregation-window / cache-size to use? Still too impossibly huge to process except with ridiculously low sampling rates??
  4. By chaining the spans we get a real-time, end-to-end trace of services and the performance (and real-time snapshot of service-by-service resource usage in a packet-switched network)
  5. How to efficiently get the beacon data to a centralised logging/management point? Send beacons via management plane? Send via data plane? Take an approach similar to Netflow / IPFIX-style protocols?
  6. How to store data for a short period (ie for real-time analysis/reporting) as well as for long periods. Due to volumes, we’d have to apply aging policies to the data, but it would still be valuable for the purpose of mid and long-term SLA, network health, optimisation, capacity management, etc

As you can see, there are still so many wide-open questions about the feasibility of the concept. But getting feedback from multiple very clever people who read this blog is definitely helping! Thank you!!

Is your service assurance really service assurance?? (Part 5)

In yesterday’s fourth part of this series about modern network service assurance, we wrote this:

I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because just as you have the concept of traces / metrics / logs for apps, you similarly have traces / metrics / logs for networks.

In the network world, we’re good at getting metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains) as described earlier in this blog series. And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering logs / events, which seems a bit backward to me.

Take a closer look at the OpenTelemetry link above, which provides an overview of how their team is going to gather application telemetry. With increasing software-ification of our networks (eg SDN / NFV) and the use of microservices / NaaS / APIs in our management stacks, could this actually be our path to the holy grail of service assurance (ie capturing trace data – network service telemetry)?? Is it data plane? Is it control / management plane? Is it something in between?

Note: The “active measurements” approach described in part 3 is slightly compromised in current form, which is why I’m so intrigued by the potential of extending the concepts of OpenTelemetry into our software / virtual networks.

I’d really love your take on this one because I’m sure there are many elements to this that I haven’t thought through yet. Please leave your thoughts on the viability of the approach.

Is your service assurance really service assurance?? (Part 4)

Yesterday’s post introduced the concept of active measurements as the better method for monitoring and assuring customer services.

Like the rest of this series, it borrowed from an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.”

Interestingly, I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because it introduces the concept of telemetry on spans (not just application nodes). Tomorrow’s article will explore how the concept of spans / traces / metrics / logs for apps might provide insight into the challenge we face getting true end-to-end metrics from our networks (as opposed to the easy to come by nodal metrics).

In the network world, we’re good at getting nodal metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains, or an aggregation of spans in OpenTelemetry nomenclature). And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering nodal logs / events, which seems a bit backward to me.

Table 4 (from the Netrounds white paper link above) provides a view of the most common AI/ML techniques used. Classification and Clustering are useful techniques for alarm / event “optimisation,” (filter, group, correlate and prioritize alarms). That is, to effectively minimise the number of alarms / events a NOC operator needs to look at. In effect, traditional data collection allows AI / ML to remove the noise, but still leaves the problem to be solved manually (ie network assurance, not service assurance)

They’re helping optimise network / resource problems, but not solving the more important service-related problems, as articulated in Table 5 below (again from Netrounds).

If we can directly collect trace data (ie the “active measurements” described in yesterday’s post), we have the data to answer specific questions (which better aligns with our narrow AI technologies of today). To paraphrase questions in the Netrounds white paper, we can ask:

  • Has the digital service been properly activated.
  • What service level is currently being experienced by customers (and are SLAs being met)
  • Is there an outage or degradation of end to end service chains (established over multi-domain, hybrid and multi-layered networks)
  • Does feedback need to be applied (eg via an orchestration solution) to heal the network

PS. Since we spoke about the AI / ML techniques of Classification and Clustering above, you might want to revisit an earlier post that discusses a contrarian approach to root-cause analysis that could use them too – Auto-releasing chaos monkeys to harden your network (CT/IR).

Is your service assurance really service assurance?? (Part 3)

Yep, this is the third part, so that might suggest that there were two lead-up articles prior to this one. Well, you’d be right:

  • The first proposed that most of what we refer to as “service assurance” is really only “network infrastructure” assurance.
  • The second then looked at the constraints we face in trying to reverse-engineer “network infrastructure” assurance into data that will allow us to assure customer services.

I should also point out that both posts, like today’s, were inspired by an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.”

Today we’ll discuss the approach/es to overcome the constraints described in yesterday’s post.

As shown via the inserted blue row in Table 6 below (source: Netrounds), a proposed solution is to use active measurements that reflect the end-to-end user experience.

The blue row only talks about the real-time monitoring of “synthetic user traffic,” in the table below. However, there are at least two other active measurement techniques that I can think of:

  • We can monitor real user traffic if we use port-mirroring techniques
  • We can also apply techniques such as TR-069 to collect real-time customer service meta data

Note: There are strengths and weaknesses of each of the three approaches, but we won’t dive into that here. Maybe another time.

You may recall in yesterday’s post that we couldn’t readily ask service-related questions of our traditional systems or data. Excitingly though, active measurement solutions do allow us to ask more customer-centric questions, like those shown in the orange box below. We can start to collect metrics that do relate directly to what the customer is paying for (eg real data throughput rates on a storage backup service). We can start to monitor real SLA metrics, not just proxy / vanity metrics (like device up-time).

Interestingly, I’ve only had the opportunity to use one vendor’s active measurement solutions so far (one synthetic transaction tool and one port-mirror tool). [The vendor is not Netrounds’ I should add. I haven’t seen Netround’s solution yet, just their insightful white paper]. Figure 3 actually does a great job of articulating why the other vendor’s UI (user interface) and APIs are currently lacking.

Whilst they do collect active metrics, the UI doesn’t allow the user to easily ask important service health questions of the data like in the orange box. Instead, the user has to dig around in all the metrics and make their own inferences. Similarly the APIs don’t allow for the identification of events (eg threshold crossing) or automatic push of notifications to external systems.

This leaves a gap in our ability to apply self-healing (automated resolution) and resolution prior to failure (prediction) algorithms like discussed in yesterday’s post. Excitingly, it can collect service-centric data. It just can’t close the loop with it yet!

More on the data tomorrow!

Is your service assurance really service assurance?? (Part 2)

In yesterday’s article, we asked whether what many know as service assurance can rightfully be called service assurance. Yesterday’s, like today’s, post was inspired by an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.”

Below are three insightful tables from the Netrounds white paper:

Table 1 looks at the typical components (systems) that service assurance is comprised of. But more interestingly, it looks at the types of questions / challenges each traditional system is designed to resolve. You’ll have noticed that none of them directly answer any service quality questions (except perhaps inventory systems, which can be prone to having sketchy associations between services and the resources they utilise).

Table 2 takes a more data-centric approach. This becomes important when we look at the big picture here – ensuring reliable and effective delivery of customer services. Infrastructure failures are a fact of life, so improved service assurance models of the future will depend on automated and predictive methods… which rely on algorithms that need data. Again, we notice an absence of service-related data sets here (apart from Inventory again). You can see the constraints of the traditional data collection approach can’t you?

Table 3 instead looks at the goals of an ideal service-centric assurance solution. The traditional systems / data are convenient but clearly don’t align well to those goals. They’re constrained by what has been presented in tables 1 and 2. Even the highly touted panaceas of AI and ML are likely to struggle against those constraints.

What if we instead start with Table 3’s assurance of customer services in mind and work our way back? Or even more precisely, what if we start with an objective of perfect availability and performance of every customer service?

That might imply self-healing (automated resolution) and resolution prior to failure (prediction) as well as resilience. But let’s first give our algorithms (and dare I say it, AI/ML techniques) a better chance of success.

Working back – What must the data look like? What should the systems look like? What questions should these new systems be answering?

More tomorrow.

Is your service assurance really service assurance??

I just came across an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.” You can find a copy here. It’s well worth a read, so much so that I’ll unpack a few of the concepts it contains in a series of articles this week.

It rightly points out that, “Alarms and fault management are what most people think of when hearing the term service assurance. Classical service assurance systems do fall into this category, as they collect indicators from network devices (such as traps, syslog messages and telemetry data) and try to pinpoint faulty devices and interfaces that need fixing.

This takes us into the rabbit-hole of what exactly is a service (a rabbit-hole that this article partly covers). But let’s put that aside for a moment and consider a service as being an end-to-end “thing” that a customer uses (and pays for, and therefore assumes will behave as “they” expect).

To borrow again from Netrounds, “… we must be able to measure and report on service KPIs in order to accurately measure network service quality from the end user, or customer, perspective. The KPIs should correspond to the service that the customer is paying for. For example, internet access services should measure network KPIs like loss, latency, jitter, and DNS and HTTP response times; a storage backup service should measure data throughput rate; IPTV should measure video frame loss, video buffer underrun events and channel zapping time; and VoIP should measure Mean Opinion Score (MOS).”

There’s just one problem with traditional assurance measuring techniques (eg traps, syslog messages). They are only an indirect proxy for the customer’s experience (and expectations) with the service they’re paying for. Traditional techniques just report on the links in the chain rather than the integrity of the entire length of chain. We have to look at each broken link and attempt to determine whether the chain’s integrity is actually impaired (considering the “meshing” that protects modern service chains). And if there is impairment, to then determine whose chain is impacted, in what way, and what priority needs to be given to its repair.

If we’re being completely honest, the customer doesn’t care about the chain links, or even their MOS score, only that they couldn’t understand what the person at the other end of the VoIP line was trying to communicate with them.

Exacerbating this further, with increasing dependency on cloud and virtualised resources means that there are more chain links that fall outside our domain of visibility.

So, this thing that we’ve called service assurance for the last few decades might actually be a misnomer. We’ve definitely been monitoring the health of network devices and infrastructure (the links), but we tend to only be able to manage services (the chain) through reverse-engineering – by inference, brute force and wizardry.

Is there another way? Let’s dig further in tomorrow’s post.

OSS Persona 10:10:10 Mapping

We sometimes attack OSS/BSS planning at a quite transactional level. For example, think about the process of gathering detailed requirements at the start of a project. They tend to be detailed and transactional don’t they? This type of requirement gathering is more like the WHAT and HOW rings in Simon Sinek’s Golden Circle.

Just curious, do you have a persona map that shows all of the different user groups that interact with your OSS/BSS?
More importantly, do you deeply understand WHY they interact with your OSS/BSS? Not just on a transaction-by-transaction level, but in the deeper context of how the organisation functions? Perhaps even on a psychological level?

If you do, you’re in a great position to apply the 10:10:10 mapping rule. That is, to describe how you’re adding value to each user group 10 minutes from now, 10 days from now and 10 months from now…

OSS Persona 10:10:10 Mapping

The mapping table could describe current tense (ie how your OSS/BSS is currently adding value), or as a planning mechanism for a future tense (ie how your OSS/BSS can add value in the future).
This mapping table can act as a guide for the evolution of your solution.

I should also point out that the diagram above only shows a sample of the internal personas that directly interact with your OSS/BSS. But I’d encourage you to look further. There are other personas that have direct and indirect engagement with your OSS/BSS. These include internal stakeholders like project sponsors, executives, data consumers, etc. They also include external stakeholders such as end-customers, regulatory bodies, etc.

If you need assistance to unlock your current state through persona mapping, real process mapping, etc and then planning out your target-state, Passionate About OSS would be delighted to help.

OSS user heat-mapping

Over the many OSS implementation projects I’ve worked on, UI/UX (user interface / user experience) has been an afterthought (if even thought about at all). I know there are OSS UI/UX experts out there (I’ve met a handful), but none have ever been assigned to the projects I’ve worked on unfortunately. UI has always just been the domain of the developer. If the functionality worked (even if in a highly convoluted way), then the developer would move on to the next epic. The UI was almost never re-visited unless the functionality proved to be almost unusable.

So the question becomes, how do we observe, measure and trial UI/UX effectiveness?

Have you ever tried running a heat-mapping analysis over your OSS to show actual user behaviour?

Given that almost all OSS are now browser-based, there are plenty of heat-map tools available. They give results that might look something like this (but can also provide more granularity of analysis too):
Heat-map
Image source: https://www.tatvic.com/data-analytics-solutions/heat-map-integration/

Whereas these tools are generally used by web developers to improve retention and conversion rates on their pages (ie customers buying, clicking through on a banner ad, calls to action, etc), we’ll use them in a different way within our OSS. We’ll instead be looking for efficiency of movement, an indicator of whether the design of our page is intuitive or not. Are the operators of your OSS clicking in the right places (menus, titles, buttons, links, etc) to achieve certain outcomes?

I’d be particularly interested in comparing heat-maps of new operators (eg if you’ve installed a sand-pit environment at a client site for the first time and let the client’s operators loose) versus experienced teams. Depending on the OSS application you’re analysing, you may even been interested in observing different behaviours across different devices (eg desktops, phones, tablets).

There’s generally a LOT of functionality available within each OSS. Are we optimising the experience for the functionality that matters most? For web-page designers, that might mean ensuring all the most important information is “above-the-fold” (ie can be seen on the page without having to scroll down – noting that the “fold” will be different across different devices/resolutions). If they want a user to click on the “buy now” button, then they *may* want that to appear above the fold, ensuring the prospective buyer doesn’t have to go searching for it.

In the case of an OSS, you don’t want to hide the most important functionality under layers of menus. And don’t forget that different personas (eg designers, admins, execs, help-desk, NOC, etc) are likely to have different definitions of “important functionality.” You may want to optimise important UI elements for each different persona (if your OSS allows for that level of configurability).

I’m not endorsing Smartlook, but if you’d like to read more about heat-mapping techniques, you can start here.

A modern twist on OSS architecture

I was speaking with a friend today about an old OSS assurance product that is undergoing a refresh and investment after years of stagnation.

He indicated that it was to come with about 20 out of the box adaptors for data collection. I found that interesting because it was replacing a product that probably had in excess of 100 adaptors. Seemed like a major backward step… until my friend pointed out the types of adaptor in this new product iteration – Splunk, AWS, etc.

Of course!!

Our OSS no longer collect data directly from the network anymore. We have web-scaled processes sucking everything out of the network / EMS, aggregating it and transforming / indexing / storing it. Then, like any other IT application, our OSS just collect what we need from a data set that has already been consolidated and homogenised.

I don’t know why I’d never thought about it like this before (ie building an architecture that doesn’t even consider connecting to the the multitude of network / device / EMS types). In doing so, we lose the direct connection to the source, but we also reduce our integration tax load (directly to the OSS at least).