Network Service Assurance has new meaning

Back in the old days, Network Service Assurance probably had a different meaning than it might today.

Clearly it’s assurance of a network service. That’s fairly obvious. But it’s in the definition of “network service” where the old and new terminologies have the potential to diverge.

In years past, telco networks were “nailed up” and network functions were physical appliances. I would’ve implied (probably incorrectly, but bear with me) that a “network service” was “owned” by the carrier and was something like a bearer circuit  (as distinct from a customer service or customer circuit). Those bearer circuits, using protocols such as in DWDM, SDH, SONET, ATM, etc potentially carried lots of customer circuits so they were definitely worth assuring. And in those nailed-up networks, we knew exactly which network appliances / resources / bearers were being utilised. This simplified service impact analysis (SIA) and allowed targeted fault-fix.

In those networks the OSS/BSS was generally able to establish a clear line of association from customer service to physical resources as per the TMN pyramid below. Yes, some abstraction happened as information permeated up the stack, but awareness of connectivity and resource utilisation was generally retained end-to-end (E2E).
OSS abstract and connect

But in the more modern computer or virtualised network, it all goes a bit haywire, perhaps starting right back at the definition of a network service.

The modern “network service” is more aligned to ETSI’s NFV definition – “a composition of network functions and defined by its functional and behavioral specification. The Network Service contributes to the behaviour of the higher layer service, which is characterised by at least performance, dependability, and security specifications. The end-to-end network service behaviour is the result of a combination of the individual network function behaviours as well as the behaviours of the network infrastructure composition mechanism.”

They are applications running at OSI’s application layer that can be consumed by other applications. These network services include DNS, DHCP, VoIP, etc, but the concept of NaaS (Network as a Service) expands the possibilities further.

So now the customer services at the top of the pyramid (BSS / BML) are quite separated from the resources at the physical layer, other than to say the customer services consume from a pool of resources (the yellow cloud below). Assurance becomes more disconnected as a result.

BSS OSS cloud abstract

OSS/BSS are able to tie customer services to pools of resources (the yellow cloud). And OSS/BSS tools also include PNI / WFM (Physical Network Inventory / Workforce Management) to manage the bottom, physical layer. But now there’s potentially an opaque gulf in the middle where virtualisation / NaaS exists.

The end-to-end association between customer services and the physical resources that carry them is lost. Unless we can find a way to establish E2E association, we just have to hope that our modern Network Service Assurance (NSA) tools make the yellow cloud robust to the point of infallibility. BTW. If the yellow cloud includes NaaS, then the NSA has to assure the NaaS gateway, catalog and all services instantiated through the gateway.

But as we know, there will always be failures in physical infrastructure (cable cuts, electronic malfunctions, etc). The individual resources can’t afford to be infallible, even if the resource pool seeks to provide collective resiliency.

Modern NSA has to find a way to manage the resource pool but also coordinate fault-fix in the physical resources that underpin it like the OSS used to do (still do??). They have to do more than just build policies and actions to ensure SLAs don’t they? They can seek to manage security, power, performance, utilisation and more. Unfortunately, not everything can be fixed programmatically, although that is a great place for NSA to start.

Perhaps if the NSA was just assuring the yellow cloud, any time it identifies any physical degredation / failure in the resource pool, it kicks a notification up to the Customer Service Assurance (CSA) tools in the OSS/BSS layers? The OSS/BSS would then coordinate 1) any required customer notifications and 2) any truck rolls or fixes that can’t be achieved programmatically; just like it already does today. The additional benefit of this two-tiered assurance approach is that NSA can handle the NFV / VNF world, whilst not trying to replicate the enormous effort that’s already been invested into the CSA (ie the existing OSS/BSS assurance stack that looks after PNFs, other physical resources and the field workforce processes that look after it all).

I’d love to hear your thoughts. Hopefully you can even correct me if/where I’m wrong.

Auto-releasing chaos monkeys to harden your network (CT/IR)

In earlier posts, we’ve talked about using Netflix’s chaos monkey approach as a way of getting to Zero Touch Assurance (ZTA). The chaos monkeys intentionally trigger faults in the network as a means of ensuring resilience. Not just for known degradation / outage events, but to unknown events too.

I’d like to introduce the concept of CT/IR – Continual Test / Incremental Resilience. Analogous to CI/CD (Continuous Integration / Continuous Delivery) before it, CT/IR is a method to systematically and programmatically test the resilience of the network, then ensuring resilience is continually improving.

The continual, incremental improvement in resiliency potentially comes via multiple feedback loops:

  1. Ideally, the existing resilience mechanisms work around or overcome any degradation or failure in the network
  2. The continual triggering of faults into the network will provide additional seed data for AI/ML tools to learn from and improve upon, especially root-cause analysis (noting that in the case of CT/IR, the root-cause is certain – we KNOW the cause – because we triggered it – rather than reverse engineering what the cause may have been)
  3. We can program the network to overcome the problem (eg turn up extra capacity, re-engineer traffic flows, change configurations, etc). Having the NaaS that we spoke about yesterday, provides greater programmability for the network by the way.
  4. We can implement systematic programs / projects to fix endemic faults or weak spots in the network *
  5. Perform regression tests to constantly stress-test the network as it evolves through network augmentation, new device types, etc

Now, you may argue that no carrier in their right mind will allow intentional faults to be triggered. So that’s where we unleash the chaos monkeys on our digital twin technology and/or PSUP (Production Support) environments at first. Then on our prod network if we develop enough trust in it.

I live in Australia, which suffers from severe bushfires every summer. Our fire-fighters spend a lot of time back-burning during the cooler months to reduce flammable material and therefore the severity of summer fires. Occasionally the back-burns get out of control, causing problems. But they’re still done for the greater good. The same principle could apply to unleashing chaos monkeys on a production network… once you’re confident in your ability to control the problems that might follow.

* When I say network, I’m also referring to the physical and logical network, but also support functions such as EMS (Element Management Systems), NCM (Network Configuration Management tools), backup/restore mechanisms, service order replay processes in the event of an outage, OSS/BSS, NaaS, etc.

Where does BSS end and OSS begin?

Over the years, I’ve been asked the question many times, “what’s the difference between OSS (Operational Support Systems) and BSS (Business Support Systems)?” I’ve also been asked, albeit slightly less regularly, how OSS and BSS map to TM Forum standards like the TAM and eTOM.

To my knowledge, TM Forum has never attempted to map OSS vs BSS. It sets off too many religious wars.

Just for fun, I thought I’d have a crack at trying to map OSS and BSS onto the TAM. Click on the image for a larger PDF version.

OSS and BSS overlaid onto the TAM

I’ve taken the perspective that customer or business-facing functionality is generally considered to be BSS. Alternatively, network / operations-facing functionality is generally considered to be OSS.
And these two tend to overlap at the service layer.

Or, you could just simply call them business operations systems (BOS) that cover the entire TAM estate.

What do you think? Does it trigger a religious war for you? Comments welcomed below.

FWIW. I come from an era when my “OSS” tools had a lot of functionality that could arguably be classified as BSS-centric (eg product management, customer relationship management, service order entry, etc). They also happened to deliver functionality that others might classify as NMS or EMS (Network Management System or Element Management System) in nature. In my mind, they’ve always just been software that supports operationalisation of a network, whether customer or network/resource-facing. It’s one of the reasons this site is called Passionate About OSS, not Passionate About OSS/BSS/NMS/EMS.

Is your OSS squeaking like an un-oiled bearing?

Network operators spend huge amounts on building and maintaining their OSS/BSS every year. There are many reasons they invest so heavily, but in most cases it can be distilled back to one thing – improving operational efficiency.

And our OSS/BSS definitely do improve operational efficiency, but there are still so many sources of friction. They’re squeaking like un-oiled bearings. Here are just a few of the common sources:

  1. First-time Installation
  2. Identifying best-fit tools
  3. Procurement of new tools
  4. Update / release processes
  5. Continuous data quality / consistency improvement
  6. Navigating to all features through the user interface
  7. Non-intuitive functionality / processes
  8. So many variants / complexity that end-users take years to attain expert-level capability
  9. Integration / interconnect
  10. Getting new starters up to speed
  11. Getting proficient operators to expertise
  12. Unlocking actionable insights from huge data piles
  13. Resolving the root-cause of complex faults
  14. Onboarding new customers
  15. Productionising new functionality
  16. Exception and fallout handling
  17. Access to supplier expertise to resolve challenges

The list goes on far deeper than that list too. The challenge for many OSS product teams, for any number of reasons, is that their focus is on adding new features rather than reducing friction in what already exists.

The challenge for product teams is diagnosing where the friction  and risks are for their customers / stakeholders. How do you get that feedback?

  • Every vendor has a product support team, so that’s a useful place to start, both in terms of what’s generating the most support calls and in terms of first-hand feedback from customers
  • Do you hold user forums on a regular basis, where you get many of your customers together to discuss their challenges, your future roadmap, new improvements / features
  • Does your process “flow” data show where the sticking points are for operators
  • Do you conduct gemba walks with your customers
  • Do you have a program of ensuring all developers spend at least a few days a year interacting directly with customers on their site/s
  • Do you observe areas of difficulty when delivering training
  • Do you go out of your way to ask your customers / stakeholders questions that are framed around their pain-points, not just framed within the context of your existing OSS
  • Do you conduct customer surveys? More importantly, do you conduct surveys through an independent third-party?

On the last dot-point, I’ve been surprised at some of the profound insights end-users have shared with me when I’ve been conducting these reviews as the independent interviewer. I’ve tended to find answers are more open / honest when being delivered to an independent third-party than if the supplier asks directly. If you’d like assistance running a third-party review, leave us a note on the contact page. We’d be delighted to assist.

Fast and slow OSS, where uCPE and network virtualisation fits in

Yesterday’s post talked about one of the many dichotomies in OSSfast and slow data / processes.

One of the longer lead-time items in relation to OSS data and processes is in network build and customer connections. From the time when capacity planning or a customer order creates the signal to build, it can be many weeks or months before the physical infrastructure work is complete and appearing in the OSS.

There are two financial downsides to this. Firstly, it tends to be CAPEX-heavy with equipment, construction, truck-rolls, government approvals, etc burning through money. Meanwhile, it’s also a period where there is no money coming in because the services aren’t turned on yet. The time-to-cash cycle of new build (or augmentation) is the bane of all telcos.

This is one of the exciting aspects of network virtualisation for telcos. In a time where connectivity is nearly ubiquitous in most countries, often with high-speed broadband access, physical build becomes less essential (except over-builds). Technologies such as uCPE (Universal Customer Premises Equipment), NFV (Network Function Virtualisation), SD WAN (Software-Defined Wide Area Networks), SDN (Software Defined Networks) and others mean that we can remotely upgrade and reconfigure the network without field work.

Network virtualisation gives the potential to speed up many of the slowest, and costliest processes that run through our OSS… but only if our OSS can support efficient orchestration of virtualised networks. And that means having an OSS with the flexibility to easily change out slow processes to replace them with fast ones without massive overhauls.

Give me a fast OSS and I might ask you to slooooow doooown

The traditional telco (and OSS) ran at different speeds. Some tasks had to happen immediately (eg customers calling one another) while others took time (eg getting a connection to a customer’s home, which included designs, approvals, builds, etc), often weeks.

Our OSS have processes that must happen sequentially and expediently. They also have processes that must wait for dependencies, conditional events and time delays. Some roles need “fast,” others can cope with “slow.” Who wins out in this dilemma?

Even the data we rely on can transact at different speeds. For capacity planning, we’re generally interested in longer-term data. We don’t have to process at real-time. Therefore we can choose to batch process at longer cycle times and with summarised data sets. For network assurance, we’re generally interested in getting data as quick as is viable.

Today’s post is about that word, viable, and pragmatism we sometimes have to apply to our OSS.

For example, if our operations teams want to reduce network performance poll cycles from every 15 mins down to once a minute, we increase the amount of data to process by 15x. That means our data storage costs go up by 15x (assuming a flat-rate cost structure applies). The other hidden cost is that our compute and network costs also go up because we have to transfer and process 15x as much data.

The trade-off we have to make in responses to this rapid escalation of cost (when going from 15 to 1 min) is in the benefits we might derive. Can we avoid SLA (Service Level Agreement) breach costs? Can we avoid costly outages? Can we avoid damage to equipment? Can we reduce the risk of losing our carrier license?

The other question is whether our operators actually have the ability to respond to 15x as much data. Do we have enough people to respond at an increased cycle time? Do we have OSS tools that are capable of filtering what’s important and disregarding “background” activity? Do we have OSS tools that are capable of learning from every single metric (eg AI), at volumes the human brain could never cope with?

Does it make sense that we have a single platform for handling fast and slow processes? For example, do we use the same platform to process 1 minute-cycle performance data for long-term planning (batch-processed once daily) and quick-fire assurance (processed as fast as possible)?

If we stick to one platform, can our OSS apply data reduction techniques (eg selective discard of records) to get the benefits of speed, but with the cost reduction of slow?

Do you wish more people fell in love with your OSS?

I’d hazard a guess that everyone reading this would admit to being a techie at some level. And being a techie, I’d also imagine that you have blatant tech-love for certain products – gadgets, apps, sites, whatever.

But, let me ask you, are there any OSS products on your love-interest list?

If yes, leave me a comment of “yes” and name of the product below.
If no, leave me a comment of “no” below.

I’m really interested and intrigued to see your answer.

There’s probably only one OSS that I’ve ever had a tech-crush on (but it’s no longer available on the market). It definitely wasn’t love at first sight. If I’m honest, it was probably the opposite. It was a love that took a long time to build. It had some cool modules, but generally it was a bit clunky. The real attraction was that the power and elegance of its data model allowed me to do almost anything with it. To build almost anything with it. To answer almost any business / network / operation question that I could dream up.

I wonder whether the same is true of your other tech-loves? Do they provide the platform for us to create/achieve things that we never dreamed we’d be able to?

If that’s true, I wonder then whether that’s one key to solving the header question?

I wonder whether the other key (the second authentication factor) is in the speed that a user can achieve the necessary level of expertise? Few users ever have the luxury that I had, spending every day for years, to establish the required expertise to make that OSS excel.

As Seth Godin says, “Make things better by making better things.”

PS. If you were kind enough to leave a Yes or No comment below, I’d also love to hear why in an additional comment.

A single glass of pain or single pane of glass??

Is your OSS a single pane of glass, or a single glass of pain?

You can tell I’m being a little flippant here. People often (perhaps idealistically) talk about OSS as being the single pane of glass (SPOG) to manage a network.

I say “idealistically” for a couple of reasons:

  1. There are usually many personas who interact with an OSS, each with vastly different user interface (UI) needs
  2. There is usually more than one OSS product in a client’s OSS suite, often from different vendors, with varying levels of integration

Where a single pane of glass can be a true ambition is as a consolidated health-status dashboard / portal, Invariably, this portal is used by executive / leader / manager personas who want to quickly see a single-screen health status that covers all networks and/or parts of the OSS suite. When things go wrong, this portal becomes the single glass of pain.

These single panes tend to be heavily customised for each organisation as every one has a unique set of metrics-that-matter. For those designing these panes, the key is to not just include vanity metrics, but to show information that the leader can action.

But the interesting perspective here is whether the single glass of pain is even relevant within your organisation’s culture. It’s just my opinion, but I prefer for coal-face workers to be empowered to make rapid recovery actions rather than requiring direction from up high in the org-chart. Coal-face workers generally have different tools with UIs that *should* help them monitor, manage and repair super-efficiently.

To get back to the “idealistic” comment above, each OSS UI needs to be fit-for-purpose for each unique persona (eg designers, product owners, network operations, etc). To me this implies that there is no single pane of glass…

I should caveat that by citing the example of an OSS search interface, something I’ve yet to see in OSS… although that’s just a front end to dozens of persona-specific panes of glass.

Unleashing the chaos monkeys on your OSS

I like to compare OSS projects with chaos theory. A single butterfly flapping it’s wings (eg a conversation with the client) can have unintended consequences that cause a tornado (eg the client’s users refusing to use a new OSS).

The day-to-day operation of a network and its management tools can be similarly sensitive to seemingly minor inputs. We can never predict or test for every combination of knock-on effects. This means that forecasting the future is impossible and failure is inevitable.

If we take these two statements to be true, it perhaps changes the way we engineer our OSS.

How many production OSS (and/or related EMS) do you know of whose operators have to tiptoe around the edges for fear of causing a meltdown? Conversely, how many do you know whose operators would quite happily trigger failures with confidence, knowing that their solution is robust and will recover without perceptibly impacting customers?

How many of you could confidently trigger scheduled or unscheduled outages of various types on your production OSS to introduce the machine learning seeding technique discussed yesterday?

Would you be prepared to unleash the chaos monkeys on your OSS / network like Netflix is prepared to do on its production systems?

Most OSS are designed for known errors and mechanisms are put in place to prevent them. Instead I wonder whether we should be design systems on the assumption that failure is inevitable, so recovery should be both rapid and automated.

It’s a subtle shift in thinking. Reduce the test scenarios that might lead to OSS failure, and increase the number of intentional OSS failures to test for recovery.

PS. Oh, and you’d rightly argue that a telco is very different from Netflix. There’s a lot more complexity in the networks, especially the legacy stacks. Many a telco would NEVER let anyone intentionally cause even the slightest degradation / failure in the network. This is where digital twin technology potentially comes into play.

An OSS without the shackles of topology

It’s been nearly two decades since I designed my first root-cause analysis (RCA) rule. It was completely reliant on network topology – more specifically, it relied on a network hierarchy to determine which alarms could be suppressed.

I had a really interesting discussion today with some colleagues who are using much more modern RCA techniques. I was somewhat surprised, but not surprised at all in hindsight, that their Machine Learning engine doesn’t even use topology data. It just looks at events and tries to identify patterns.

That’s a really interesting insight that hadn’t dawned on me before. But it’s an exciting one because it effectively unshackles our fault management tools from data quality perfection in our inventory / asset databases. It also possibly lessens the need for integrations that share topological data.

Equally interesting, the ML engine had identified over 4,000 patterns, but only a dozen had been codified and put into use so far. In other words, the machine was learning, but humans still needed to get involved in the process to confirm that the machine had learned correctly.

Makes me wonder whether the ML pre-seeding technique we discussed in an earlier post might actually be useful for confirmations at a greater scale than the team had achieved with 12 of 4000+ to date.

The standard approach is to let the ML loose and identify patterns. This is the reactive approach. The ML reacts to the alarms that are pushed up from the network. It looks at alarms and determines what the root cause is based on historical data. A human then has to check that the root cause is correct by reverse engineering the alarm stream (just like a network operator used to do before RCA tools came along) and comparing. If the comparison is successful, the person then approves this pattern.

My proposed alternate approach is the proactive method. If we proactively trigger a fault (e.g. pull a patch lead, take a port down, etc), we start from a position of already knowing what the root cause is. This has three benefits:
1. We can check if the ML’s interpretation of root cause is right
2. We’ve proactively seeded the ML’s data with this root cause example
3. We categorically know what the root cause is, unlike the reactive mode which only assumes the operator has correctly diagnosed the root cause

Then we just have to figure out a whole bunch of proactive failures to test safely. Where to start? Well, I’d speak with the NOC operators to find out what their most common root causes are and look to trigger those events.

More tomorrow on intentionally triggering failures in production systems.

Mythical OSS beasts – feature removal releases

Life can be improved by adding, or by subtracting. The world pushes us to add, because that benefits them. But the secret is to focus on subtracting…

No amount of adding will get me where I want to be. The adding mindset is deeply ingrained. It’s easy to think I need something else. It’s hard to look instead at what to remove.

The least successful people I know run in conflicting directions, drawn to distractions, say yes to almost everything, and are chained to emotional obstacles.

The most successful people I know have a narrow focus, protect against time-wasters, say no to almost everything, and have let go of old limiting beliefs.”
Derek Sivers, here.

I’m really curious here. Have you ever heard of an OSS product team removing a feature? Nope?? Me either!

I’ve seen products re-factored, resulting in changes to features. I’ve also seen products obsoleted and their replacements not offer all of the same features. But what about a version upgrade to an existing OSS product that has features subtracted? That never happens does it?? The adding mindset is deeply ingrained.

So let’s say we do want to go on a subtraction drive and remove some of the clutter from our OSS. I know plenty of OSS GUIs where subtraction is desperately needed BTW! But how do we know what to remove?

I have no data to back this up, but I would guess that almost every OSS would have certain functions that are not used, by any of their customers, in a whole year. That functionality was probably built for a specific use-case for a specific customer that no longer has relevance. Perhaps for a service type that is no longer desired by the market or a network type that will never be used again.

Question is, does your OSS have profiling instrumentation that allows you to measure what functionality is and isn’t used across your whole client base?

Can your products team readily produce a usage profile graph like the following that shows a list of functions (x-axis) by the number of times each function is used (y-axis) in a given time window? Per client? Across all clients?
Long-tail of OSS functionality use

Leave us a comment below if you’ve ever seen this type of profiling instrumentation (not for code optimisation, but for identifying client utilisation levels) and/or systematic feature subtraction initiatives.

BTW. I should make the distinction that just because a function hasn’t been used in a while, doesn’t mean it should automatically be removed. Some functionality (eg data loaders) might be rarely used, but important to retain.

I was a huge bottleneck on my first OSS project

I became a problematic bottleneck on my first OSS project. It didn’t start that way, but it definitely ended that way. And I’ve been thinking ever since about how I could’ve managed that better.

I started out as a network subject matter expert but wasn’t a bottleneck in that role. However, the next two functions I absorbed were the source of the problem.

The first additional role was in becoming the unofficial document librarian. Most of the documents coming into our organisation came through me. Being inquisitive, I’d review each document and try to apply it to what my colleagues and I were trying to achieve. When the team had an information void, they’d come to me with the problem and I’d not just point them to the relevant document/s but dive into helping to solve the problem.

The next role was assisting to model network data into the OSS database. This morphed into becoming responsible for all of the data in the database. In those days, I didn’t have a Minimum Viable Data (MVD) mindset. Instead it was an ingest-it-all-and-figure-out-how-to-use-it-later mentality. When the team had a data void, they’d come to me with the problem and I’d not just point them to the relevant data and what it meant but dive into helping to solve their problem/s.

You can see how this is leading to being a bottleneck can’t you?

I was effectively asking for all problems to be re-routed through me. Every person on the project (except possibly the project admins) relied on documentation and data. I averaged 85 hour weeks for about 2.5 years on that project, but still didn’t get close to servicing all the requests. Great as a learning exercise. Not great for the project.

Twenty years on, how would I do it better? Well, let me ask first, how would you do it better?

You possibly have many more ideas, but the two I’d like to leave you with are:

  • Figure out ways to make teaching more repeatable and self-learnt
  • Very closely aligned, and more importantly, is in asking leading questions that help others solve their own problems

It still feels like it’s less helpful to not dive into solving the problem, but it undoubtedly improves overall team efficiency and growth.

Oh, and by the way, if you’re just starting out in OSS and want to speed up your own development into becoming an OSS linchpin – find your way into the document librarian and/or data management roles. After all these years on OSS projects, I still think these are the best places to launch into the learning curve from.

The use of drones by OSS

The last few days have been all about organisational structuring to support OSS and digital transformations. Today we take a different tack – a more technical diversion – onto how drones might be relevant to the field of OSS.

A friend recently asked for help to look into the use of drones in his archaeological business. This got me to thinking about how they might apply in cross-over with OSS.

I know they’re already used to perform really accurate 3D cable route / corridor surveying. Much cooler than the old surveyor diagrams on A1 sheets from the old days. Apparently experts in the field can even tell if there’s rock in the surveyed area by looking at the vegetation patterns, heat and LIDAR scans.

But my main area of interest is in the physical inventory. With accurate geo-tagging available on drones and the ability to GPS correct the data, it seems like a really useful technique for getting outside plant (OSP) data into OSS inventory systems. Or geo-correcting data for brownfields assets.
Drone-based cable corridor surveys
Have you heard of drone-based OSP asset identification and mapping data being fed into inventory systems yet? I haven’t, but it seems like the logical next step. Do you know anyone who has started to dabble in this type of work? If you do, please send me a note as I’d love to be introduced.

Once loaded into the inventory system, with 3d geo-location, we then have the ability to visualise the OSP data with augmented reality solutions.

And other applications for drone technology?

Using my graphene analogy to help fix OSS data

By now I’m sure you’ve heard about graph databases. You may’ve even read my earlier article about the benefits graph databases offer when modelling network inventory when compared with relational databases. But have you heard the Graphene Database Analogy?

I equate OSS data migration and data quality improvement with graphene, which is made up of single layers of carbon atoms in hexagonal lattices (planes).

The graphene data model

There are four concepts of interest with the graphene model:

  1. Data Planes – Preparing and ingesting data from siloes (eg devices, cards, ports) is relatively easy. ie building planes of data (black carbon atoms and bonds above)
  2. Bonds between planes – It’s the interconnections between siloes (eg circuits, network links, patch-leads, joints in pits, etc) that is usually trickier. So I envisage alignment of nodes (on the data plane or graph, not necessarily network nodes) as equivalent to bonds between carbon atoms on separate planes (red/blue/aqua lines above).
    Alignment comes in many forms:

    1. Through spatial alignment (eg a joint and pit have the same geospatial position, so the joint is probably inside the pit)
    2. Through naming conventions (eg same circuit name associated with two equipment ports)
    3. Various other linking-key strategies
    4. Nodes on each data plane can potentially be snapped together (either by an operator or an algorithm) if you find consistent ways of aligning nodes that are adjacent across planes
  3. Confidence – I like to think about data quality in terms of confidence-levels. Some data is highly reliable, other data sets less so. For example if you have two equipment ports with a circuit name identifier, then your confidence level might be 4 out of 4* because you know the exact termination points of that circuit. Conversely, let’s say you just have a circuit with a name that follows a convention of “LocA-LocB-speed-index-type” but has no associated port data. In that case you only know that the circuit terminates at LocationA and LocationB, but not which building, rack, device, card, port so your confidence level might only be 2 out of 4.
  4. Visualisation – Having these connected panes of data allows you to visualise heat-map confidence levels (and potentially gaps in the graph) on your OSS data, thus identifying where data-fix (eg physical audits) is required

* the example of a circuit with two related ports above might not always achieve 4 out of 4 if other checks are applied (eg if there are actually 3 ports with that associated circuit name in the data but we know it should represent a two-ended patch-lead).

Note: The diagram above (from graphene-info.com) shows red/blue/aqua links between graphene layers as capturing hydrogen, but is useful for approximating the concept of aligning nodes between planes

Can OSS/BSS assist CX? We’re barely touching the surface

Have you ever experienced an epic customer experience (CX) fail when dealing a network service operator, like the one I described yesterday?

In that example, the OSS/BSS, and possibly the associated people / process, had a direct impact on poor customer experience. Admittedly, that 7 truck-roll experience was a number of years ago now.

We have fewer excuses these days. Smart phones and network connected devices allow us to get OSS/BSS data into the field in ways we previously couldn’t. There’s no need for printed job lists, design packs and the like. Our OSS/BSS can leverage these connected devices to give far better decision intelligence in real time.

If we look to the logistics industry, we can see how parcel tracking technologies help to automatically provide status / progress to parcel recipients. We can see how recipients can also modify their availability, which automatically adjusts logistics delivery sequencing / scheduling.

This has multiple benefits for the logistics company:

  • It increases first time delivery rates
  • Improves the ability to automatically notify customers (eg email, SMS, chatbots)
  • Decreases customer enquiries / complaints
  • Decreases the amount of time the truck drivers need to spend communicating back to base and with clients
  • But most importantly, it improves the customer experience

Logistics is an interesting challenge for our OSS/BSS due to the sheer volume of customer interaction events handled each day.

But it’s another area that excites me even more, where CX is improved through improved data quality:

  • It’s the ability for field workers to interact with OSS/BSS data in real-time
  • To see the design packs
  • To compare with field situations
  • To update the data where there is inconsistency.

Even more excitingly, to introduce augmented reality to assist with decision intelligence for field work crews:

  • To provide an overlay of what fibres need to be spliced together
  • To show exactly which port a patch-lead needs to connect to
  • To show where an underground cable route goes
  • To show where a cable runs through trayway in a data centre
  • etc, etc

We’re barely touching the surface of how our OSS/BSS can assist with CX.

Calculating the cost of quality

This week of posts has followed the theme of the cost of quality. Data quality that is.

But how do you calculate the cost of bad data quality?

Yesterday’s post mentioned starting with PNI (Physical Network Inventory). PNI is the cables, splices / joints, patch panels, ducts, pits, etc. This data doesn’t tend to have a programmable interface to electronically reconcile with. This makes it prone to errors of many types – mistakes in manual entry, reconfigurations that are never documented, assets that are lost or stolen, assets that are damaged or degraded, etc.

Some costs resulting from poor PNI data quality (DQ) can be considered primary costs. This includes SLA breaches caused by an inability to identify a fault within an SLA window due to incorrect / incomplete / indecipherable design data. These costs are the most obvious and easy to calculate because they result in SLA penalties. If a network operator misses a few of these with tier 1 clients then this is the disaster referred to yesterday.

But the true cost of quality is in the ripple-out effects. The secondary costs. These include the many factors that result in unnecessary truck rolls. With truck rolls come extra costs including contractor costs, delayed revenues, design rework costs, etc.

Other secondary effects include:

  • Downstream data maintenance in systems that rely on PNI data
  • Code in downstream systems that caters for poor data quality, which in turn increases the costs of complexity such as:
    • Additional testing
    • Additional fixing
    • Additional curation
  • Delays in the ability to get new products to market
  • Ability to accurately price products (due to variation in real costs caused by extra complexity)
  • Reduced impact of automations (due to increased variants)
  • Potential to impact Machine Learning / Artificial Intelligence engines, which rely on reliable and consistent data at scale
  • etc

There are probably more sophisticated ways to calculate the cost of quality across all these factors and more, but in most cases I just use a simple multiplier:

  • Number of instances of DQ events (eg number of additional truck rolls); times by
  • A rule-of-thumb cost impact of each event (eg the cost of each additional truck roll)

Sometimes the rules-of-thumb are challenging to estimate, so I tend to err on the side of conservatism. I figure that even if the rules-of-thumb aren’t perfectly accurate, at least they produce a real cost estimate rather than just anecdotal evidence.

And more importantly, the tertiary and less tangible costs of brand damage (also known as Customer Experience or CX or reputation damage). We’ll talk a little more about that tomorrow.

 

Waiting for the disaster to invest in the data

Have you seen OSS tools where the applications are brilliant but consigned to failure by bad data? I definitely have! I call it the data death spiral. It’s a well known fact in the industry that bad data can ruin an OSS. You know it. I know it. Everyone knows it.

But how many companies do you know that invest in data quality? I mean truly invest in it.

The status quo is not to invest in the data, but the disaster. That is the disaster caused by the data!

Being a data nerd, it boggles my brain to understand why that is. My only assumption to date is that we don’t adequately measure the cost of quality. Or more to the point, what the cost impact is resulting from bad data.

I recently attempted to model the cost of quality. My model focuses on the ripple-out impacts from poor PNI (Physical Network Inventory) quality data alone. Using conservative numbers, the cost of quality is in the millions for the first carrier I applied it to.

Why do you think operators wait for the disaster before investing in the data? What alternate techniques do you use to focus attention, and investment, on the data?

Where an absence of OSS data can still provide insights

The diagram below has some parallels with OSS. The story however is a little long before it gets to the OSS part, so please bear with me.

null

The diagram shows analysis the US Navy performed during WWII on where planes were being shot. The theory was that they should be reinforcing the areas that received the most hits (ie the wing-tips, central part of the body, etc as per the diagram).

Abraham Wald, a statistician, had a completely different perspective. His rationale was that the hits would be more uniform and the blank sections on the diagram above represent the areas that needed reinforcement. Why? Well, the planes that were hit there never made it home for analysis.

In OSS, this is akin to the device or EMS that has failed and is unable to send any telemetry data. No alarms are appearing in our alarm lists for those devices / EMS because they’re no longer capable of sending any.

That’s why we use heartbeat mechanisms to confirm a system is alive (and capable of sending alarms). These come in the form of pollers, pingers or just watching for other signs of life such as network traffic of any form.

In what other areas of OSS can an absence of data demonstrate an insight?

The layers of ITIL redundancy

Today’s is something of a heretical post, especially for the believers in ITIL. In the world of OSS, we look to build in layers of resiliency and not layers of redundancy.

The following diagram and subsequent text in italics describes a typical ITIL process and is all taken from https://www.computereconomics.com/article.cfm?id=1074

Example of relationship between ITIL incidents, problems, and changes

The sequence of events as shown in Figure 1 is as follows:

  • At TIME = 0, an External Event is detected by the Incident Management process. This could be as simple as a customer calling to say that service is unavailable or it could be an automated alert from a system monitoring device.The incident owner logs and classifies this as incident i2. Then, the incident owner tries to match i2 to known errors, work-arounds, or temporary fixes, but cannot find a match in the database.
  • At TIME = 1, the incident owner dispatches a problem request to the Problem Management process anticipating a work-around, temporary fix, or other assistance. In doing so, the incident owner has prompted the creation of Problem p2.
  • At TIME = 2, the problem owner of p2 returns the expected temporary fix to the incident owner of i2.  Note that both i2 and p2 are active and exist simultaneously. The incident owner for i2 applies the temporary fix.
  • In this case, the work-around requires a change request.  So, at Time = 3, the incident owner for i2 initiates change request, c2.
  • The change request c2 is applied successfully, and at TIME = 4, c2 is closed. Note that for a while i2, p2 and c2 all exist simultaneously.
  • Because c2 was successful, the incident owner for i2 can now confirm that the incident is resolved. At TIME = 5, i2 is closed. However, p2 remains active while the problem owner searches for a permanent fix. The problem owner for p2 would be responsible for implementing the permanent fix and initiating any necessary change requests.

But I look at it slightly differently. At their root, why do Incident Management, Problem Management and Change Management exist? They’re all just mechanisms for resolving a system* health issue. If we detect an event/s and fix it, we don’t have to expend all the effort of flicking tickets around.

Thinking within the T2R paradigm of trouble -> incident -> problem -> change -> resolve holds us back. If we can skip the middle steps and immediately associate a resolution with the event/s, we get a whole lot more efficient. If we can immediately relate a trigger with a reaction, we can also get rid of the intermediate ticket flickers and the processing cycle time.

So, the NOC of the future surely requires us to build a trigger -> reaction recommendation engine (and body of knowledge). That’s a more powerful tool to supply to our NOC operators than incidents, problems and change requests. (Easier to write about than to actually solve though of course)

The TMN model suffers from modern network anxiety

As the TMN diagram below describes, each layer up in the network management stack abstracts but connects (as described in more detail in “What an OSS shouldn’t do“). That is, each higher layer reduces the amount if information/control within a domain that it’s responsible for, but it assumes a broader responsibility for connecting multiple domains together.
OSS abstract and connect

There’s just one problem with the diagram. It’s a little dated when we take modern virtualised infrastructure into account.

In the old days, despite what the layers may imply, it was common for an OSS to actually touch every layer of the pyramid to resolve faults. That is, OSS regularly connected to NMS, EMS and even devices (NE) to gather network health data. The services defined at the top of the stack (BSS) could be traced to the exact devices (NE / NEL) via the circuits that traversed them, regardless of the layers of abstraction. It helped for root-cause analysis (RCA) and service impact analysis (SIA).

But with modern networks, the infrastructure is virtualised, load-balanced and since they’re packet-switched, they’re completely circuitless (I’m excluding virtual circuits here by the way). The bottom three layers of the diagram could effectively be replaced with a cloud icon, a cloud that the OSS has little chance of peering into (see yellow cloud in the diagram later in this post).

The concept of virtualisation adds many sub-layers of complexity too by the way, as higlighted in the diagram below.

ONAP triptych

So now the customer services at the top of the pyramid (BSS / BML) are quite separated from the resources at the bottom, other than to say the services consume from a known pool of resources. Fault resolution becomes more abstracted as a result.

But what’s interesting is that there’s another layer that’s not shown on the typical TMN model above. That is the physical network inventory (PNI) layer. The cables, splices, joints, patch panels, equipment cards, etc that underpin every network. Yes, even virtual networks.

In the old networks the OSS touched every layer, including the missing layer. That functionality was provided by PNI management. Fault resolution also occurred at this layer through tickets of work conducted by the field workforce (Workforce Management – WFM).

In new networks, OSS/BSS tie services to resource pools (the top two layers). They also still manage PNI / WFM (the bottom, physical layer). But then there’s potentially an invisible cloud in the middle. Three distinctly different pieces, probably each managed by a different business unit or operational group.
BSS OSS cloud abstract

Just wondering – has your OSS/BSS developed control anxiety issues from losing some of the control that it once had?