Lightning strikes in OSS

Operators have developed many unique understandings of what impacts the health of their networks.

For example, mobile operators know that they have faster maintenance cycles in coastal areas than they do in warm, dry areas (yes, due to rust). Other operators have a high percentage of faults that are power-related. Others are impacted by failures caused by lightning strikes.

Near-real-time weather pattern and lightning strike data is now readily accessible, potentially for use by our OSS.

I was just speaking with one such operator last week who said, “We looked at it [using lightning strike data] but we ended up jumping at shadows most of the time. We actually started… looking for DSLAM alarms which will show us clumps of power failures and strikes, then we investigate those clumps and determine a cause. Sometimes we send out a single truck to collect artifacts, photos of lightning damage to cables, etc.”

That discussion got me wondering about what other lateral approaches are used by operators to assure their networks. For example:

  1. What external data sources do you use (eg meteorology, lightning strike, power feed data from power suppliers or sensors, sensor networks, etc)
  2. Do you use it in proactive or reactive mode (eg to diagnose a fault or to use engineering techniques to prevent faults)
  3. Have you built algorithms (eg root-cause, predictive maintenance, etc) to utilise your external data sources
  4. If so, do those algorithms help establish automated closed-loop detect and response cycles
  5. By measuring and managing, has it created quantifiable improvements in your network health

I’d love to hear about your clever and unique insight-generation ideas. Or even the ideas you’ve proposed that haven’t been built yet.

282 million reasons for increased OSS/BSS scrutiny

The hotel group Marriott International has been told by the UK Information Commissioner’s Office that it will be fined a little over £99 million (A$178 million) over a data breach that occurred in December last year…
This is the second fine for data breaches announced by the ICO on successive days. On Monday, it said British Airways would be fined £183.39 million (A$329.1 million) for a data breach that occurred in September 2018
.”
Sam Varghese of ITwire.

The scale of the fines issued to Marriott and BA is mind-boggling.

Here’s a link to the GDPR (General Data Protection Regulation) fine regime and determination process. Fines can be issued by GDPR policing agencies of up to €20 million, or 4% of the worldwide annual revenue of the prior financial year, whichever is higher.

Determination is based on the following questions:

  1. Nature of infringement: number of people affected, damaged they suffered, duration of infringement, and purpose of processing
  2. Intention: whether the infringement is intentional or negligent
  3. Mitigation: actions taken to mitigate damage to data subjects
  4. Preventative measures: how much technical and organizational preparation the firm had previously implemented to prevent non-compliance
  5. History: (83.2e) past relevant infringements, which may be interpreted to include infringements under the Data Protection Directive and not just the GDPR, and (83.2i) past administrative corrective actions under the GDPR, from warnings to bans on processing and fines
  6. Cooperation: how cooperative the firm has been with the supervisory authority to remedy the infringement
  7. Data type: what types of data the infringement impacts; see special categories of personal data
  8. Notification: whether the infringement was proactively reported to the supervisory authority by the firm itself or a third party
  9. Certification: whether the firm had qualified under approved certifications or adhered to approved codes of conduct
  10. Other: other aggravating or mitigating factors may include financial impact on the firm from the infringement

The two examples listed above provide 282 million reasons for governments to police data protection more stringently than they do today. The regulatory pressure is only going to increase right? As I understand it, these processes are only enforced in reactive mode currently. What if the regulators become move to proactive mode?

Question for you – Looking at #7 above, do you think the customer information stored in your OSS/BSS is more or less “impactful” than that of Marriott or British Airways?

Think about this question in terms of the number of daily interactions you have with hotels and airlines versus telcos / ISPs. I’ve stayed in Marriott hotels for over a year in accumulated days. I’ve boarded hundreds of flights. But I can’t begin to imagine how many of my data points the telcos / ISP could potentially collect every day. It’s in our OSS/BSS data stores where those data points are most likely to end up.

Do you think our OSS/BSS are going to come under increasing GDPR-like scrutiny in coming years? Put it this way, I suspect we’re going to become more familiar with risk management around the 10 dot points above than we have been in the past.

Step-by-step guide to build a systematic root-cause analysis (RCA) pipeline

Fault / Alarm management tools have lots of strings to their functionality bows to help operators focus in on the target/s that matter most. ITU-T’s recommendation X.733 provided an early framework and common model for classification of alarms. This allowed OSS vendors to build a standardised set of filters (eg severity, probable cause, etc). ITU-T’s recommendation M.3703 then provided a set of guiding use cases for managing alarms. These recommendations have been around since the 1990’s (or possibly even before).

Despite these “noise reduction” tools being readily available, they’re still not “compressing” event lists enough in all cases.

I imagine, like me, you’ve heard many customer stories where so many new events are appearing in an event list each day that the NOC (network operations centre) just can’t keep up. Dozens of new events are appearing on the screen, then scrolling off the bottom of it before an operator has even had a chance to stop and think about a resolution.

So if humans can’t keep up with the volume, we need to empower machines with their faster processing capabilities to do the job. But to do so, we first have to take a step away from the noise and help build a systematic root-cause analysis (RCA) pipeline.

I call it a pipeline because there are generally a lot of RCA rules that are required. There are a few general RCA rules that can be applied “out of the box” on a generic network, but most need to be specifically crafted to each network.

So here’s a step-by-step guide to build your RCA pipeline:

  1. Scope – Identify your initial target / scope. For example, what are you seeking to prioritise:
    1. Event volume reduction to give the NOC breathing space to function better
    2. Identifying “most important” events (but defining what is most important)
    3. Minimising SLA breaches
    4. etc
  2. Gather Data – Gather incident and ticket data. Your OSS is probably already doing this, but you may need to pull data together from various sources (eg alarms/events, performance, tickets, external sources like weather data, etc)
  3. Pattern Identification – Pattern identification and categorisation of incidents. This generally requires a pattern identification tool, ideally supplied by your alarm management and/or analytics supplier
  4. Prioritise – Using a long-tail graph like below, prioritise pattern groups by the following (and in line with item #1 above):
      1. Number of instances of the pattern / group (ie frequency)
      2. Priority of instances (ie urgency of resolution)
      3. Number of linked incidents (ie volume)
      4. Other technique, such as a cumulative/blended metric

  5. Gather Resolution Knowledge – Understand current NOC approaches to fault-identification and triage, as well as what’s important to them (noting that they may have biases such as managing to vanity metrics)
  6. Note any Existing Resolutions – Identify and categorise any existing resolutions and/or RCA rules (if data supports this)
  7. Short-list Remaining Patterns – Overlay resolution pattern on long-tail (to show which patterns are already solved for). then identify remaining priority patterns on the long-tail that don’t have a resolution yet.
  8. Codify Patterns – Progressively set out to identify possible root-cause by analysing cause-effect such as:
    1. Topology-based
    2. Object hierarchy
    3. Time-based ripple
    4. Geo-based ripple
    5. Other (as helped to be defined by NOC operators)
  9. Knowledge base – Create a knowledge base that itemises root-causes and supporting information
  10. Build Algorithm / Automation – Create an algorithm for identifying root-cause and related alarms. Identify level of complexity, risks, unknowns, likelihood, control/monitoring plan for post-install, etc. Then build pilot algorithm (and possibly roll-back technique??). This might not just be an RCA rule, but could also include other automations. Automations could include creating a common problem and linking all events (not just root cause event but all related events), escalations, triggering automated workflows, etc
  11. Test pilot algorithm (with analytics??)
  12. Introduce algorithm into production use – But continue to monitor what’s being suppressed to
  13. Repeat – Then repeat from steps 7 to 12 to codify the next most important pattern
  14. Leading metrics – Identify leading metrics and/or preventative measures that could precede the RCA rule. Establish closed-loop automated resolution
  15. Improve – Manage and maintain process improvement

What if most OSS/BSS are overkill? Planning a simpler version

You may recall a recent article that provided a discussion around the demarcation between OSS and BSS, which included the following graph:

Note that this mapping is just my demarc interpretation, but isn’t the definitive guide. It’s definitely open to differing opinions (ie religious wars).

Many of you will be familiar with the framework that the mapping is overlaid onto – TM Forum’s TAM (The Application Map). Version R17.5.1 in this case. It is as close as we get to a standard mapping of OSS/BSS functionality modules. I find it to be a really useful guide, so today’s article is going to call on the TAM again.

As you would’ve noticed in the diagram above, there are many, many modules that make up the complete OSS/BSS estate. And you should note that the diagram above only includes Level 2 mapping. The TAM recommendation gets a lot more granular than this. This level of granularity can be really important for large, complex telcos.

For the OSS/BSS that support smaller telcos, network providers or utilities, this might be overkill. Similarly, there are OSS/BSS vendors that want to cover all or large parts of the entire estate for these types of customers. But as you’d expect, they don’t want to provide the same depth of functionality coverage that the big telcos might need.

As such, I thought I’d provide the cut-down TAM mapping below for those who want a less complex OSS/BSS suite.

It’s a really subjective mapping because each telco, provider or vendor will have their own perspective on mandatory features or modules. Hopefully it provides a useful starting point for planning a low complexity OSS/BSS.

Then what high-level functionality goes into these building blocks? That’s possibly even more subjective, but here are some hints:

OSS that repair virtualised networks – the dual loop approach

In a recent article, we talked about Network Service Assurance (NSA) in an environment where network virtualisation exists.

One of the benefits of virtualisation or NaaS (Network as a Service) is that it provides a layer of programmability to your network. That is, to be able to instantiate network services by software through a network API. Virtualisation also tends to assume/imply that there is a huge amount of available capacity (the resource pool) that it can shift workloads between. If one virtual service instance dies or deteriorates, then just automatically spin up another. If one route goes down, customer services are automatically re-directed via alternate routes and the service is maintained. No problem…

But there are some problems that can’t be solved in software. You can’t just use software to fix a cable that’s been cut by an excavator. You can’t just use software to fix failed electronics. Modern virtualised networks can do a great job of self-healing, routing around the problem areas. But there are still physical failures that need to be repaired / replaced / maintained by a field workforce. NSA doesn’t tend to cover that.

Looking at the diagram below, NSA does a great job of the closed-loop assurance within the red circle. But it then needs to kick out to the green closed-loop assurance processes that are already driven by our OSS/BSS.

As described in the link above, “Perhaps if the NSA was just assuring the yellow cloud/s, any time it identifies any physical degradation / failure in the resource pool, it kicks a notification up to the Customer Service Assurance (CSA) tools in the OSS/BSS layers? The OSS/BSS would then coordinate 1) any required customer notifications and 2) any truck rolls or fixes that can’t be achieved programmatically; just like it already does today. The additional benefit of this two-tiered assurance approach is that NSA can handle the NFV / VNF world, whilst not trying to replicate the enormous effort that’s already been invested into the CSA (ie the existing OSS/BSS assurance stack that looks after PNFs, other physical resources and the field workforce processes that look after it all).

Therefore, a key part of the NSA process is how it kicks up from closed-loop 1 to closed-loop 2. Then, after closed-loop 2 has repaired the physical problem, NSA needs to be aware that the repaired resource is now back in the pool of available resources. Does your NSA automatically notice this, or must it receive a notification from closed loop 2?

It could be as simple as NSA sending alarms into the alarm list with a clearly articulate root-cause. The alarm has a ticket/s raised against it. The ticket triggers the field workforce to rectify it and the triggers customer assurance teams/tools to send notifications to impacted customers (if indeed they send notifications to customers who may not actually be effected yet due to the resilience measures that have kicked in). Standard OSS/BSS practice!

OSS change…. but not too much… oh no…..

Let me start today with a question:
Does your future OSS/BSS need to be drastically different to what it is today?

Please leave me a comment below, answering yes or no.

I’m going to take a guess that most OSS/BSS experts will answer yes to this question, that our future OSS/BSS will change significantly. It’s the reason I wrote the OSS Call for Innovation manifesto some time back. As great as our OSS/BSS are, there’s still so much need for improvement.

But big improvement needs big change. And big change is scary, as Tom Nolle points out:
IT vendors, like most vendors, recognize that too much revolution doesn’t sell. You have to creep up on change, get buyers disconnected from the comfortable past and then get them to face not the ultimate future but a future that’s not too frightening.”

Do you feel like we’re already in the midst of a revolution? Cloud computing, web-scaling and virtualisation (of IT and networks) have been partly responsible for it. Agile and continuous integration/delivery models too.

The following diagram shows a “from the moon” level view of how I approach (almost) any new project.

The key to Tom’s quote above is in step 2. Just how far, or how ambitious, into the future are you projecting your required change? Do you even know what that future will look like? After all, the environment we’re operating within is changing so fast. That’s why Tom is suggesting that for many of us, step 2 is just a “creep up on it change.” The gap is essentially small.

The “creep up on it change” means just adding a few new relatively meaningless features at the end of the long tail of functionality. That’s because we’ve already had the most meaningful functionality in our OSS/BSS for decades (eg customer management, product / catalog management, service management, service activation, network / service health management, inventory / resource management, partner management, workforce management, etc). We’ve had the functionality, but that doesn’t mean we’ve perfected the cost or process efficiency of using it.

So let’s say we look at step 2 with a slightly different mindset. Let’s say we don’t try to add any new functionality. We lock that down to what we already have. Instead we do re-factoring and try to pull the efficiency levers, which means changes to:

  1. Platforms (eg cloud computing, web-scaling and virtualisation as well as associated management applications)
  2. Methodologies (eg Agile, DevOps, CI/CD, noting of course that they’re more than just methodologies, but also come with tools, etc)
  3. Process (eg User Experience / User Interfaces [UX/UI], supply chain, business process re-invention, machine-led automations, etc)

It’s harder for most people to visualise what the Step 2 Future State looks like. And if it’s harder to envisage Step 2, how do we then move onto Steps 3 and 4 with confidence?

This is the challenge for OSS/BSS vendors, supplier, integrators and implementers. How do we, “get buyers disconnected from the comfortable past and then get them to face not the ultimate future but a future that’s not too frightening?” And I should point out, that it’s not just buyers we need to get disconnected from the comfortable past, but ourselves, myself definitely included.

Network Service Assurance has new meaning

Back in the old days, Network Service Assurance probably had a different meaning than it might today.

Clearly it’s assurance of a network service. That’s fairly obvious. But it’s in the definition of “network service” where the old and new terminologies have the potential to diverge.

In years past, telco networks were “nailed up” and network functions were physical appliances. I would’ve implied (probably incorrectly, but bear with me) that a “network service” was “owned” by the carrier and was something like a bearer circuit  (as distinct from a customer service or customer circuit). Those bearer circuits, using protocols such as in DWDM, SDH, SONET, ATM, etc potentially carried lots of customer circuits so they were definitely worth assuring. And in those nailed-up networks, we knew exactly which network appliances / resources / bearers were being utilised. This simplified service impact analysis (SIA) and allowed targeted fault-fix.

In those networks the OSS/BSS was generally able to establish a clear line of association from customer service to physical resources as per the TMN pyramid below. Yes, some abstraction happened as information permeated up the stack, but awareness of connectivity and resource utilisation was generally retained end-to-end (E2E).
OSS abstract and connect

But in the more modern computer or virtualised network, it all goes a bit haywire, perhaps starting right back at the definition of a network service.

The modern “network service” is more aligned to ETSI’s NFV definition – “a composition of network functions and defined by its functional and behavioral specification. The Network Service contributes to the behaviour of the higher layer service, which is characterised by at least performance, dependability, and security specifications. The end-to-end network service behaviour is the result of a combination of the individual network function behaviours as well as the behaviours of the network infrastructure composition mechanism.”

They are applications running at OSI’s application layer that can be consumed by other applications. These network services include DNS, DHCP, VoIP, etc, but the concept of NaaS (Network as a Service) expands the possibilities further.

So now the customer services at the top of the pyramid (BSS / BML) are quite separated from the resources at the physical layer, other than to say the customer services consume from a pool of resources (the yellow cloud below). Assurance becomes more disconnected as a result.

BSS OSS cloud abstract

OSS/BSS are able to tie customer services to pools of resources (the yellow cloud). And OSS/BSS tools also include PNI / WFM (Physical Network Inventory / Workforce Management) to manage the bottom, physical layer. But now there’s potentially an opaque gulf in the middle where virtualisation / NaaS exists.

The end-to-end association between customer services and the physical resources that carry them is lost. Unless we can find a way to establish E2E association, we just have to hope that our modern Network Service Assurance (NSA) tools make the yellow cloud robust to the point of infallibility. BTW. If the yellow cloud includes NaaS, then the NSA has to assure the NaaS gateway, catalog and all services instantiated through the gateway.

But as we know, there will always be failures in physical infrastructure (cable cuts, electronic malfunctions, etc). The individual resources can’t afford to be infallible, even if the resource pool seeks to provide collective resiliency.

Modern NSA has to find a way to manage the resource pool but also coordinate fault-fix in the physical resources that underpin it like the OSS used to do (still do??). They have to do more than just build policies and actions to ensure SLAs don’t they? They can seek to manage security, power, performance, utilisation and more. Unfortunately, not everything can be fixed programmatically, although that is a great place for NSA to start.

Perhaps if the NSA was just assuring the yellow cloud, any time it identifies any physical degradation / failure in the resource pool, it kicks a notification up to the Customer Service Assurance (CSA) tools in the OSS/BSS layers? The OSS/BSS would then coordinate 1) any required customer notifications and 2) any truck rolls or fixes that can’t be achieved programmatically; just like it already does today. The additional benefit of this two-tiered assurance approach is that NSA can handle the NFV / VNF world, whilst not trying to replicate the enormous effort that’s already been invested into the CSA (ie the existing OSS/BSS assurance stack that looks after PNFs, other physical resources and the field workforce processes that look after it all).

I’d love to hear your thoughts. Hopefully you can even correct me if/where I’m wrong.

Auto-releasing chaos monkeys to harden your network (CT/IR)

In earlier posts, we’ve talked about using Netflix’s chaos monkey approach as a way of getting to Zero Touch Assurance (ZTA). The chaos monkeys intentionally trigger faults in the network as a means of ensuring resilience. Not just for known degradation / outage events, but to unknown events too.

I’d like to introduce the concept of CT/IR – Continual Test / Incremental Resilience. Analogous to CI/CD (Continuous Integration / Continuous Delivery) before it, CT/IR is a method to systematically and programmatically test the resilience of the network, then ensuring resilience is continually improving.

The continual, incremental improvement in resiliency potentially comes via multiple feedback loops:

  1. Ideally, the existing resilience mechanisms work around or overcome any degradation or failure in the network
  2. The continual triggering of faults into the network will provide additional seed data for AI/ML tools to learn from and improve upon, especially root-cause analysis (noting that in the case of CT/IR, the root-cause is certain – we KNOW the cause – because we triggered it – rather than reverse engineering what the cause may have been)
  3. We can program the network to overcome the problem (eg turn up extra capacity, re-engineer traffic flows, change configurations, etc). Having the NaaS that we spoke about yesterday, provides greater programmability for the network by the way.
  4. We can implement systematic programs / projects to fix endemic faults or weak spots in the network *
  5. Perform regression tests to constantly stress-test the network as it evolves through network augmentation, new device types, etc

Now, you may argue that no carrier in their right mind will allow intentional faults to be triggered. So that’s where we unleash the chaos monkeys on our digital twin technology and/or PSUP (Production Support) environments at first. Then on our prod network if we develop enough trust in it.

I live in Australia, which suffers from severe bushfires every summer. Our fire-fighters spend a lot of time back-burning during the cooler months to reduce flammable material and therefore the severity of summer fires. Occasionally the back-burns get out of control, causing problems. But they’re still done for the greater good. The same principle could apply to unleashing chaos monkeys on a production network… once you’re confident in your ability to control the problems that might follow.

* When I say network, I’m also referring to the physical and logical network, but also support functions such as EMS (Element Management Systems), NCM (Network Configuration Management tools), backup/restore mechanisms, service order replay processes in the event of an outage, OSS/BSS, NaaS, etc.

Where does BSS end and OSS begin?

Over the years, I’ve been asked the question many times, “what’s the difference between OSS (Operational Support Systems) and BSS (Business Support Systems)?” I’ve also been asked, albeit slightly less regularly, how OSS and BSS map to TM Forum standards like the TAM and eTOM.

To my knowledge, TM Forum has never attempted to map OSS vs BSS. It sets off too many religious wars.

Just for fun, I thought I’d have a crack at trying to map OSS and BSS onto the TAM. Click on the image for a larger PDF version.

OSS and BSS overlaid onto the TAM

I’ve taken the perspective that customer or business-facing functionality is generally considered to be BSS. Alternatively, network / operations-facing functionality is generally considered to be OSS.
And these two tend to overlap at the service layer.

Or, you could just simply call them business operations systems (BOS) that cover the entire TAM estate.

What do you think? Does it trigger a religious war for you? Comments welcomed below.

FWIW. I come from an era when my “OSS” tools had a lot of functionality that could arguably be classified as BSS-centric (eg product management, customer relationship management, service order entry, etc). They also happened to deliver functionality that others might classify as NMS or EMS (Network Management System or Element Management System) in nature. In my mind, they’ve always just been software that supports operationalisation of a network, whether customer or network/resource-facing. It’s one of the reasons this site is called Passionate About OSS, not Passionate About OSS/BSS/NMS/EMS.

Is your OSS squeaking like an un-oiled bearing?

Network operators spend huge amounts on building and maintaining their OSS/BSS every year. There are many reasons they invest so heavily, but in most cases it can be distilled back to one thing – improving operational efficiency.

And our OSS/BSS definitely do improve operational efficiency, but there are still so many sources of friction. They’re squeaking like un-oiled bearings. Here are just a few of the common sources:

  1. First-time Installation
  2. Identifying best-fit tools
  3. Procurement of new tools
  4. Update / release processes
  5. Continuous data quality / consistency improvement
  6. Navigating to all features through the user interface
  7. Non-intuitive functionality / processes
  8. So many variants / complexity that end-users take years to attain expert-level capability
  9. Integration / interconnect
  10. Getting new starters up to speed
  11. Getting proficient operators to expertise
  12. Unlocking actionable insights from huge data piles
  13. Resolving the root-cause of complex faults
  14. Onboarding new customers
  15. Productionising new functionality
  16. Exception and fallout handling
  17. Access to supplier expertise to resolve challenges

The list goes on far deeper than that list too. The challenge for many OSS product teams, for any number of reasons, is that their focus is on adding new features rather than reducing friction in what already exists.

The challenge for product teams is diagnosing where the friction  and risks are for their customers / stakeholders. How do you get that feedback?

  • Every vendor has a product support team, so that’s a useful place to start, both in terms of what’s generating the most support calls and in terms of first-hand feedback from customers
  • Do you hold user forums on a regular basis, where you get many of your customers together to discuss their challenges, your future roadmap, new improvements / features
  • Does your process “flow” data show where the sticking points are for operators
  • Do you conduct gemba walks with your customers
  • Do you have a program of ensuring all developers spend at least a few days a year interacting directly with customers on their site/s
  • Do you observe areas of difficulty when delivering training
  • Do you go out of your way to ask your customers / stakeholders questions that are framed around their pain-points, not just framed within the context of your existing OSS
  • Do you conduct customer surveys? More importantly, do you conduct surveys through an independent third-party?

On the last dot-point, I’ve been surprised at some of the profound insights end-users have shared with me when I’ve been conducting these reviews as the independent interviewer. I’ve tended to find answers are more open / honest when being delivered to an independent third-party than if the supplier asks directly. If you’d like assistance running a third-party review, leave us a note on the contact page. We’d be delighted to assist.

Fast and slow OSS, where uCPE and network virtualisation fits in

Yesterday’s post talked about one of the many dichotomies in OSSfast and slow data / processes.

One of the longer lead-time items in relation to OSS data and processes is in network build and customer connections. From the time when capacity planning or a customer order creates the signal to build, it can be many weeks or months before the physical infrastructure work is complete and appearing in the OSS.

There are two financial downsides to this. Firstly, it tends to be CAPEX-heavy with equipment, construction, truck-rolls, government approvals, etc burning through money. Meanwhile, it’s also a period where there is no money coming in because the services aren’t turned on yet. The time-to-cash cycle of new build (or augmentation) is the bane of all telcos.

This is one of the exciting aspects of network virtualisation for telcos. In a time where connectivity is nearly ubiquitous in most countries, often with high-speed broadband access, physical build becomes less essential (except over-builds). Technologies such as uCPE (Universal Customer Premises Equipment), NFV (Network Function Virtualisation), SD WAN (Software-Defined Wide Area Networks), SDN (Software Defined Networks) and others mean that we can remotely upgrade and reconfigure the network without field work.

Network virtualisation gives the potential to speed up many of the slowest, and costliest processes that run through our OSS… but only if our OSS can support efficient orchestration of virtualised networks. And that means having an OSS with the flexibility to easily change out slow processes to replace them with fast ones without massive overhauls.

Give me a fast OSS and I might ask you to slooooow doooown

The traditional telco (and OSS) ran at different speeds. Some tasks had to happen immediately (eg customers calling one another) while others took time (eg getting a connection to a customer’s home, which included designs, approvals, builds, etc), often weeks.

Our OSS have processes that must happen sequentially and expediently. They also have processes that must wait for dependencies, conditional events and time delays. Some roles need “fast,” others can cope with “slow.” Who wins out in this dilemma?

Even the data we rely on can transact at different speeds. For capacity planning, we’re generally interested in longer-term data. We don’t have to process at real-time. Therefore we can choose to batch process at longer cycle times and with summarised data sets. For network assurance, we’re generally interested in getting data as quick as is viable.

Today’s post is about that word, viable, and pragmatism we sometimes have to apply to our OSS.

For example, if our operations teams want to reduce network performance poll cycles from every 15 mins down to once a minute, we increase the amount of data to process by 15x. That means our data storage costs go up by 15x (assuming a flat-rate cost structure applies). The other hidden cost is that our compute and network costs also go up because we have to transfer and process 15x as much data.

The trade-off we have to make in responses to this rapid escalation of cost (when going from 15 to 1 min) is in the benefits we might derive. Can we avoid SLA (Service Level Agreement) breach costs? Can we avoid costly outages? Can we avoid damage to equipment? Can we reduce the risk of losing our carrier license?

The other question is whether our operators actually have the ability to respond to 15x as much data. Do we have enough people to respond at an increased cycle time? Do we have OSS tools that are capable of filtering what’s important and disregarding “background” activity? Do we have OSS tools that are capable of learning from every single metric (eg AI), at volumes the human brain could never cope with?

Does it make sense that we have a single platform for handling fast and slow processes? For example, do we use the same platform to process 1 minute-cycle performance data for long-term planning (batch-processed once daily) and quick-fire assurance (processed as fast as possible)?

If we stick to one platform, can our OSS apply data reduction techniques (eg selective discard of records) to get the benefits of speed, but with the cost reduction of slow?

Do you wish more people fell in love with your OSS?

I’d hazard a guess that everyone reading this would admit to being a techie at some level. And being a techie, I’d also imagine that you have blatant tech-love for certain products – gadgets, apps, sites, whatever.

But, let me ask you, are there any OSS products on your love-interest list?

If yes, leave me a comment of “yes” and name of the product below.
If no, leave me a comment of “no” below.

I’m really interested and intrigued to see your answer.

There’s probably only one OSS that I’ve ever had a tech-crush on (but it’s no longer available on the market). It definitely wasn’t love at first sight. If I’m honest, it was probably the opposite. It was a love that took a long time to build. It had some cool modules, but generally it was a bit clunky. The real attraction was that the power and elegance of its data model allowed me to do almost anything with it. To build almost anything with it. To answer almost any business / network / operation question that I could dream up.

I wonder whether the same is true of your other tech-loves? Do they provide the platform for us to create/achieve things that we never dreamed we’d be able to?

If that’s true, I wonder then whether that’s one key to solving the header question?

I wonder whether the other key (the second authentication factor) is in the speed that a user can achieve the necessary level of expertise? Few users ever have the luxury that I had, spending every day for years, to establish the required expertise to make that OSS excel.

As Seth Godin says, “Make things better by making better things.”

PS. If you were kind enough to leave a Yes or No comment below, I’d also love to hear why in an additional comment.

A single glass of pain or single pane of glass??

Is your OSS a single pane of glass, or a single glass of pain?

You can tell I’m being a little flippant here. People often (perhaps idealistically) talk about OSS as being the single pane of glass (SPOG) to manage a network.

I say “idealistically” for a couple of reasons:

  1. There are usually many personas who interact with an OSS, each with vastly different user interface (UI) needs
  2. There is usually more than one OSS product in a client’s OSS suite, often from different vendors, with varying levels of integration

Where a single pane of glass can be a true ambition is as a consolidated health-status dashboard / portal, Invariably, this portal is used by executive / leader / manager personas who want to quickly see a single-screen health status that covers all networks and/or parts of the OSS suite. When things go wrong, this portal becomes the single glass of pain.

These single panes tend to be heavily customised for each organisation as every one has a unique set of metrics-that-matter. For those designing these panes, the key is to not just include vanity metrics, but to show information that the leader can action.

But the interesting perspective here is whether the single glass of pain is even relevant within your organisation’s culture. It’s just my opinion, but I prefer for coal-face workers to be empowered to make rapid recovery actions rather than requiring direction from up high in the org-chart. Coal-face workers generally have different tools with UIs that *should* help them monitor, manage and repair super-efficiently.

To get back to the “idealistic” comment above, each OSS UI needs to be fit-for-purpose for each unique persona (eg designers, product owners, network operations, etc). To me this implies that there is no single pane of glass…

I should caveat that by citing the example of an OSS search interface, something I’ve yet to see in OSS… although that’s just a front end to dozens of persona-specific panes of glass.

Unleashing the chaos monkeys on your OSS

I like to compare OSS projects with chaos theory. A single butterfly flapping it’s wings (eg a conversation with the client) can have unintended consequences that cause a tornado (eg the client’s users refusing to use a new OSS).

The day-to-day operation of a network and its management tools can be similarly sensitive to seemingly minor inputs. We can never predict or test for every combination of knock-on effects. This means that forecasting the future is impossible and failure is inevitable.

If we take these two statements to be true, it perhaps changes the way we engineer our OSS.

How many production OSS (and/or related EMS) do you know of whose operators have to tiptoe around the edges for fear of causing a meltdown? Conversely, how many do you know whose operators would quite happily trigger failures with confidence, knowing that their solution is robust and will recover without perceptibly impacting customers?

How many of you could confidently trigger scheduled or unscheduled outages of various types on your production OSS to introduce the machine learning seeding technique discussed yesterday?

Would you be prepared to unleash the chaos monkeys on your OSS / network like Netflix is prepared to do on its production systems?

Most OSS are designed for known errors and mechanisms are put in place to prevent them. Instead I wonder whether we should be design systems on the assumption that failure is inevitable, so recovery should be both rapid and automated.

It’s a subtle shift in thinking. Reduce the test scenarios that might lead to OSS failure, and increase the number of intentional OSS failures to test for recovery.

PS. Oh, and you’d rightly argue that a telco is very different from Netflix. There’s a lot more complexity in the networks, especially the legacy stacks. Many a telco would NEVER let anyone intentionally cause even the slightest degradation / failure in the network. This is where digital twin technology potentially comes into play.

An OSS without the shackles of topology

It’s been nearly two decades since I designed my first root-cause analysis (RCA) rule. It was completely reliant on network topology – more specifically, it relied on a network hierarchy to determine which alarms could be suppressed.

I had a really interesting discussion today with some colleagues who are using much more modern RCA techniques. I was somewhat surprised, but not surprised at all in hindsight, that their Machine Learning engine doesn’t even use topology data. It just looks at events and tries to identify patterns.

That’s a really interesting insight that hadn’t dawned on me before. But it’s an exciting one because it effectively unshackles our fault management tools from data quality perfection in our inventory / asset databases. It also possibly lessens the need for integrations that share topological data.

Equally interesting, the ML engine had identified over 4,000 patterns, but only a dozen had been codified and put into use so far. In other words, the machine was learning, but humans still needed to get involved in the process to confirm that the machine had learned correctly.

Makes me wonder whether the ML pre-seeding technique we discussed in an earlier post might actually be useful for confirmations at a greater scale than the team had achieved with 12 of 4000+ to date.

The standard approach is to let the ML loose and identify patterns. This is the reactive approach. The ML reacts to the alarms that are pushed up from the network. It looks at alarms and determines what the root cause is based on historical data. A human then has to check that the root cause is correct by reverse engineering the alarm stream (just like a network operator used to do before RCA tools came along) and comparing. If the comparison is successful, the person then approves this pattern.

My proposed alternate approach is the proactive method. If we proactively trigger a fault (e.g. pull a patch lead, take a port down, etc), we start from a position of already knowing what the root cause is. This has three benefits:
1. We can check if the ML’s interpretation of root cause is right
2. We’ve proactively seeded the ML’s data with this root cause example
3. We categorically know what the root cause is, unlike the reactive mode which only assumes the operator has correctly diagnosed the root cause

Then we just have to figure out a whole bunch of proactive failures to test safely. Where to start? Well, I’d speak with the NOC operators to find out what their most common root causes are and look to trigger those events.

More tomorrow on intentionally triggering failures in production systems.

Mythical OSS beasts – feature removal releases

Life can be improved by adding, or by subtracting. The world pushes us to add, because that benefits them. But the secret is to focus on subtracting…

No amount of adding will get me where I want to be. The adding mindset is deeply ingrained. It’s easy to think I need something else. It’s hard to look instead at what to remove.

The least successful people I know run in conflicting directions, drawn to distractions, say yes to almost everything, and are chained to emotional obstacles.

The most successful people I know have a narrow focus, protect against time-wasters, say no to almost everything, and have let go of old limiting beliefs.”
Derek Sivers, here.

I’m really curious here. Have you ever heard of an OSS product team removing a feature? Nope?? Me either!

I’ve seen products re-factored, resulting in changes to features. I’ve also seen products obsoleted and their replacements not offer all of the same features. But what about a version upgrade to an existing OSS product that has features subtracted? That never happens does it?? The adding mindset is deeply ingrained.

So let’s say we do want to go on a subtraction drive and remove some of the clutter from our OSS. I know plenty of OSS GUIs where subtraction is desperately needed BTW! But how do we know what to remove?

I have no data to back this up, but I would guess that almost every OSS would have certain functions that are not used, by any of their customers, in a whole year. That functionality was probably built for a specific use-case for a specific customer that no longer has relevance. Perhaps for a service type that is no longer desired by the market or a network type that will never be used again.

Question is, does your OSS have profiling instrumentation that allows you to measure what functionality is and isn’t used across your whole client base?

Can your products team readily produce a usage profile graph like the following that shows a list of functions (x-axis) by the number of times each function is used (y-axis) in a given time window? Per client? Across all clients?
Long-tail of OSS functionality use

Leave us a comment below if you’ve ever seen this type of profiling instrumentation (not for code optimisation, but for identifying client utilisation levels) and/or systematic feature subtraction initiatives.

BTW. I should make the distinction that just because a function hasn’t been used in a while, doesn’t mean it should automatically be removed. Some functionality (eg data loaders) might be rarely used, but important to retain.

I was a huge bottleneck on my first OSS project

I became a problematic bottleneck on my first OSS project. It didn’t start that way, but it definitely ended that way. And I’ve been thinking ever since about how I could’ve managed that better.

I started out as a network subject matter expert but wasn’t a bottleneck in that role. However, the next two functions I absorbed were the source of the problem.

The first additional role was in becoming the unofficial document librarian. Most of the documents coming into our organisation came through me. Being inquisitive, I’d review each document and try to apply it to what my colleagues and I were trying to achieve. When the team had an information void, they’d come to me with the problem and I’d not just point them to the relevant document/s but dive into helping to solve the problem.

The next role was assisting to model network data into the OSS database. This morphed into becoming responsible for all of the data in the database. In those days, I didn’t have a Minimum Viable Data (MVD) mindset. Instead it was an ingest-it-all-and-figure-out-how-to-use-it-later mentality. When the team had a data void, they’d come to me with the problem and I’d not just point them to the relevant data and what it meant but dive into helping to solve their problem/s.

You can see how this is leading to being a bottleneck can’t you?

I was effectively asking for all problems to be re-routed through me. Every person on the project (except possibly the project admins) relied on documentation and data. I averaged 85 hour weeks for about 2.5 years on that project, but still didn’t get close to servicing all the requests. Great as a learning exercise. Not great for the project.

Twenty years on, how would I do it better? Well, let me ask first, how would you do it better?

You possibly have many more ideas, but the two I’d like to leave you with are:

  • Figure out ways to make teaching more repeatable and self-learnt
  • Very closely aligned, and more importantly, is in asking leading questions that help others solve their own problems

It still feels like it’s less helpful to not dive into solving the problem, but it undoubtedly improves overall team efficiency and growth.

Oh, and by the way, if you’re just starting out in OSS and want to speed up your own development into becoming an OSS linchpin – find your way into the document librarian and/or data management roles. After all these years on OSS projects, I still think these are the best places to launch into the learning curve from.

The use of drones by OSS

The last few days have been all about organisational structuring to support OSS and digital transformations. Today we take a different tack – a more technical diversion – onto how drones might be relevant to the field of OSS.

A friend recently asked for help to look into the use of drones in his archaeological business. This got me to thinking about how they might apply in cross-over with OSS.

I know they’re already used to perform really accurate 3D cable route / corridor surveying. Much cooler than the old surveyor diagrams on A1 sheets from the old days. Apparently experts in the field can even tell if there’s rock in the surveyed area by looking at the vegetation patterns, heat and LIDAR scans.

But my main area of interest is in the physical inventory. With accurate geo-tagging available on drones and the ability to GPS correct the data, it seems like a really useful technique for getting outside plant (OSP) data into OSS inventory systems. Or geo-correcting data for brownfields assets.
Drone-based cable corridor surveys
Have you heard of drone-based OSP asset identification and mapping data being fed into inventory systems yet? I haven’t, but it seems like the logical next step. Do you know anyone who has started to dabble in this type of work? If you do, please send me a note as I’d love to be introduced.

Once loaded into the inventory system, with 3d geo-location, we then have the ability to visualise the OSP data with augmented reality solutions.

And other applications for drone technology?

Using my graphene analogy to help fix OSS data

By now I’m sure you’ve heard about graph databases. You may’ve even read my earlier article about the benefits graph databases offer when modelling network inventory when compared with relational databases. But have you heard the Graphene Database Analogy?

I equate OSS data migration and data quality improvement with graphene, which is made up of single layers of carbon atoms in hexagonal lattices (planes).

The graphene data model

There are four concepts of interest with the graphene model:

  1. Data Planes – Preparing and ingesting data from siloes (eg devices, cards, ports) is relatively easy. ie building planes of data (black carbon atoms and bonds above)
  2. Bonds between planes – It’s the interconnections between siloes (eg circuits, network links, patch-leads, joints in pits, etc) that is usually trickier. So I envisage alignment of nodes (on the data plane or graph, not necessarily network nodes) as equivalent to bonds between carbon atoms on separate planes (red/blue/aqua lines above).
    Alignment comes in many forms:

    1. Through spatial alignment (eg a joint and pit have the same geospatial position, so the joint is probably inside the pit)
    2. Through naming conventions (eg same circuit name associated with two equipment ports)
    3. Various other linking-key strategies
    4. Nodes on each data plane can potentially be snapped together (either by an operator or an algorithm) if you find consistent ways of aligning nodes that are adjacent across planes
  3. Confidence – I like to think about data quality in terms of confidence-levels. Some data is highly reliable, other data sets less so. For example if you have two equipment ports with a circuit name identifier, then your confidence level might be 4 out of 4* because you know the exact termination points of that circuit. Conversely, let’s say you just have a circuit with a name that follows a convention of “LocA-LocB-speed-index-type” but has no associated port data. In that case you only know that the circuit terminates at LocationA and LocationB, but not which building, rack, device, card, port so your confidence level might only be 2 out of 4.
  4. Visualisation – Having these connected panes of data allows you to visualise heat-map confidence levels (and potentially gaps in the graph) on your OSS data, thus identifying where data-fix (eg physical audits) is required

* the example of a circuit with two related ports above might not always achieve 4 out of 4 if other checks are applied (eg if there are actually 3 ports with that associated circuit name in the data but we know it should represent a two-ended patch-lead).

Note: The diagram above (from graphene-info.com) shows red/blue/aqua links between graphene layers as capturing hydrogen, but is useful for approximating the concept of aligning nodes between planes