A modern twist on OSS architecture

I was speaking with a friend today about an old OSS assurance product that is undergoing a refresh and investment after years of stagnation.

He indicated that it was to come with about 20 out of the box adaptors for data collection. I found that interesting because it was replacing a product that probably had in excess of 100 adaptors. Seemed like a major backward step… until my friend pointed out the types of adaptor in this new product iteration – Splunk, AWS, etc.

Of course!!

Our OSS no longer collect data directly from the network anymore. We have web-scaled processes sucking everything out of the network / EMS, aggregating it and transforming / indexing / storing it. Then, like any other IT application, our OSS just collect what we need from a data set that has already been consolidated and homogenised.

I don’t know why I’d never thought about it like this before (ie building an architecture that doesn’t even consider connecting to the the multitude of network / device / EMS types). In doing so, we lose the direct connection to the source, but we also reduce our integration tax load (directly to the OSS at least).

I’m really excited by a just-finished OSS analysis (part 3)

This is the third part of a series describing a really exciting analysis I’ve just finished.

Part 1 described how we can turn simple log files into a Sankey diagram that shows real-life process flows (not just a theoretical diagram drawn by BAs and SMEs), like below:

Part 2 described how the logs are broken down into a design tree and how we can assign weightings to each branch based on the data stored in the logs, as below:
OSS Decision Tree Analysis

I’ve already had lots of great feedback in relation to the Part 1 blog, especially from people who’ve had challenges capturing as-is process. The feedback has been greatly appreciated so I’m looking forward to helping them draw up their flow-charts on the way to helping optimise their process flows.

But that’s just the starting point. Today’s post is where things get really exciting (for me at least). Today we build on part 2 and not just record weightings, but use them to assist future decisions.

We can use the decision tree to “predict forward” and help operators / algorithms make optimal decisions whilst working towards process completion. We can use a feedback loop to steer an operator (or application) down the most optimal branches of the tree (and/or avoid the fall-out variants).

This allows us to create a closed-loop, self-optimising, Decision Support System (DSS), as follows:

Note: Diagram sourced from https://passionateaboutoss.com/closing-the-loop-to-make-better-decisions, where further explanation is provided

Using log data alone, we can perform decision optimisation based on “likelihood of success” or “time to complete” as per the weightings table. If supplemented with additional data, the weightings table could also allow decisions to be optimised by “cost to complete” or many other factors.

The model has the potential to be used in “real-time” mode, using the constant stream of process logs to continually refine and adapt. For example:

  • If the long-term average of a process path is 1 minute, but there’s currently a problem with and that path is failing, then another path (one that is otherwise slightly less optimised over the long-term), could be used until the first path is repaired
  • An operator happens to choose a new, more optimal path than has ever been identified previously (the delta function in the diagram). It then sets a new benchmark and informs the new approach via the DSS (Darwinian selection)

If you’re wondering how the DSS could be implemented, I can envisage a few ways:

  1. Using existing RPA (Robotic Process Automation) tools [which are particularly relevant if the workflow box in the diagram above crosses multiple different applications (not just a single monolithic OSS/BSS)]
  2. Providing a feedback path into the functionality of the OSS/BSS and it’s GUI
  3. Via notifications (eg email, Slack, etc) to operators
  4. Via a simple, more manual process like flow diagrams, work instructions, scorecards or similar
  5. You can probably envisage other methods

I’m really excited by a just-finished OSS analysis (part 2)

As the title suggests, this is the second part in a series describing a process flow visualisation, optimisation and decision support methodology that uses simple log data as input.

Yesterday’s post, part 1 in the series, showed the visualisation aspect in the form of a Sankey flow diagram.

This visualisation is exciting because it shows how your processes are actually flowing (or not), as opposed to the theoretical process diagrams that are laboriously created by BAs in conjunction with SMEs. It also shows which branches in the flow are actually being utilised and where inefficiencies are appearing (and are therefore optimisation targets).

Some people have wondered how simple activity logs can be used to show the Sankey diagrams. Hopefully the diagram below helps to describe this. You scan the log data looking for variants / patterns of flows and overlay those onto a map of decision states (DPs). In the diagram above, there are only 3 DPs, but 303 different variants (sounds implausible, but there are many variants that do multiple loops through the 3 states and are therefore considered to be a different variant).

OSS Decision Tree Analysis

The numbers / weightings you see on the Sankey diagram are the number* of instances (of a single flow type) that have transitioned between two DPs / states.

* Note that this is not the same as the count value that appears in the Weightings table. We’ll get to that in tomorrow’s post when we describe how to use the weightings data for decision support.

I’m really excited by a just-finished OSS analysis

In your travels, I don’t suppose you’ve ever come across anyone having challenges to capture and/or optimise their as-is OSS/BSS process flows? Once or twice?? 🙂

Well I’ve just completed an analysis that I’m really excited about. It’s something I’ve been thinking about for some time, but have just finished proving on the weekend. I thought it might have relevance to you too. It quickly helps to visualise as-is process and identify areas to optimise.

The method takes activity logs (eg from OSS, ITIL, WFM, SAP or similar) and turns them into a process diagram (a Sankey diagram) like below with real instance volumes. Much better than a theoretical process map designed by BAs and SMEs don’t you think?? And much faster and more accurate too!!

OSS Sankey process diagram

A theoretical process map might just show a sequence of 3 steps, but the diagram above has used actual logs to show what’s really occurring. It highlights governance issues (skipped steps) and inefficiencies (ie the various loops) in the process too. Perfect for process improvement.

But more excitingly, it proves a path towards real-time “predict-forward” decision support without having to get into the complexities of AI. More has been included in the analysis!

If this is of interest to you, let me know and I’ll be happy to walk you through the full analysis. Or if you want to know how your real as-is processes perform, I’d be happy to help turn your logs into visuals like the one above.

PS1. You might think you need a lot of fields to prepare the diagrams above. The good news is the only mandatory fields would be something like:

  1. Flow type – eg Order type, project type or similar (only required if the extract contains multiple flow types mixed together. The diagram above represents just one flow type)
  2. Flow instance identifier – eg Order number, project number or similar (the diagram above was based on data that had around 600,000 flow instances)
  3. Activity identifier – eg Activity name (as per the 3 states in the diagram above), recorded against each flow instance. Note that they will ideally be an enumerated list (ie from a finite pick-list)
  4. Timestamps – Start/end timestamp on each activity instance

If the log contains other details such as the name of the operator who completed each activity, that can help add richness, but not mandatory.

PS2. The main objective of the analysis was to test concepts raised in the following blog posts:

Can you solve the omni-channel identity conundrum for OSS/BSS?

For most end-customers, the OSS/BSS we create are merely back-office systems that they never see. The closest they get are the customer portals that they interact with to drive workflows through our OSS/BSS. And yet, our OSS/BSS still have a big part to play in customer experience. In times where customers can readily substitute one carrier for another, customer service has become a key differentiator for many carriers. It therefore also becomes a priority for our OSS/BSS.

Customers now have multiple engagement options (aka omni-channel) and form factors (eg in-person, phone, tablet, mobile phone, kiosk, etc). The only options we used to have were a call to a contact centre / IVR (Interactive Voice Response), a visit to a store, or a visit from an account manager for business customers. Now there are websites, applications, text messages, multiple social media channels, chatbots, portals, blogs, etc. They all represent different challenges as far as offering a seamless customer experience across all channels.

I’ve just noticed TM Forum’s “Omni-channel Guidebook” (GB994), which does a great job at describing the challenges and opportunities. For example, it explains the importance of identity. End-users can only get a truly seamless experience if they can be uniquely identified across all channels. Unfortunately, some channels (eg IVR, website) don’t force end-users to self-identify.

The Ovum report, “Optimizing Customer Service in a Multi Channel World, March 2011” indicates that around 74% of customers use 3 channels or more for engaging customer service. In most cases, it’s our OSS/BSS that provide the data that supports a seamless experience across channels. But what if we have no unique key? What if the unique key we have (eg phone number) doesn’t uniquely identify the different people who use that contact point (eg different family members who use the same fixed-line phone)?

We could use personality profiling across these channels, but we’ve already seen how that has worked out for Cambridge Analytica and Facebook in terms of customer privacy and security.

I’d love to hear how you’ve done cross-channel identity management in your OSS/BSS. Have you solved the omni-channel identity conundrum?

PS. One thing I find really interesting. The whole omni-channel thing is about giving customers (or potential customers) the ability to connect via the channel they’re most comfortable with. But there’s one glaring exception. When an end-user decides a phone conversation is the only way to resolve their issue (often after already trying the self-service options), they call the contact centre number. But many big telcos insist on trying to deflect as many calls as possible to self-service options (they call it CVR – call volume reduction), because contact centre staff are much more expensive per transaction than the automated channels. That seems to be an anti-customer-experience technique if you ask me. What are your thoughts?

The 3 states of OSS consciousness

The last four posts have discussed how our OSS/BSS need to cope with different modes of working to perform effectively. We started off with the thread of “group flow,” where multiple different users of our tools can work cohesively. Then we talked about how flow requires a lack of interruptions, yet many of the roles using our OSS actually need constant availability (ie to be constantly interrupted).

From a user experience (UI/UX) perspective, we need an awareness of the state the operator/s needs to be in to perform each step of an end-to-end process, be it:

  • Deep think or flow mode – where the operator needs uninterrupted time to resolve a complex and/or complicated activity (eg a design activity)
  • Constant availability mode – where the operator needs to quickly respond to the needs of others and therefore needs a stream of notifications / interruptions (eg network fault resolutions)
  • Group flow mode – where a group of operators need to collaborate effectively and cohesively to resolve a complex and/or complicated activity (eg resolve a cross-domain fault situation)

This is a strong argument for every OSS/BSS supplier to have UI/UX experts on their team. Yet most leave their UI/UX with their coders. They tend to take the perspective that if the function can be performed, it’s time to move on to building the next function. That was the same argument used by all MP3 player suppliers before the iPod came along with its beautiful form and function and dominated the market.

Interestingly, modern architectural principles potentially make UI/UX design more challenging. With old, monolithic OSS/BSS, you at least had more control over end-to-end workflows (I’m not suggesting we should go back to the monoliths BTW). These days, you need to accommodate the unique nuances / inconsistencies of third-party modules like APIs / microservices.

As Evan Linwood incisively identified, ” I guess we live in the age of cloud based API providers, theoretically enabling loads of pre-canned integration patterns but these may not be ideal for a large service provider… Definitely if the underlying availability isn’t there, but could also occur through things like schema mismanagement across multiple providers? (Which might actually be an argument for better management / B/OSS, rather than against the use of microservices!

Am I convincing any of you to hire more UI/UX resources? Or convincing you to register for UI/UX as your next training course instead of learning a ninth programming language?

Put simply, we need your assistance to take our OSS from this…
Old MP3 player

To this…
iPod

Completing an OSS design, going inside, going outside, going Navy SEAL

Our most recent post last week discussed the research organisations like DARPA (Defense Advanced Research Projects Agency) and Google are investing into group flow for the purpose of group effectiveness. It cites the cost of training ($4.25m) each elite Navy SEAL and their ability to operate as if choreographed in high pressure / noise environments.

We contrasted this with the mechanisms used in most OSS that actually prevent flow-state from occurring. Today I’m going to dive into the work that goes into creating a new design (to activate a customer), and how our current OSS designs / processes inhibit flow.

Completely independently of our post, BBC released an article last week discussing how deep focus needs to become a central pillar of our future workplace culture.

To quote,

“Being switched on at all times and expected to pick things up immediately makes us miserable, says [Cal] Newport. “It mismatches with the social circuits in our brain. It makes us feel bad that someone is waiting for us to reply to them. It makes us anxious.”

Because it is so easy to dash off a quick reply on email, Slack or other messaging apps, we feel guilty for not doing so, and there is an expectation that we will do it. This, says Newport, has greatly increased the number of things on people’s plates. “The average knowledge worker is responsible for more things than they were before email. This makes us frenetic. We should be thinking about how to remove the things on their plate, not giving people more to do…

Going cold turkey on email or Slack will only work if there is an alternative in place. Newport suggests, as many others now do, that physical communication is more effective. But the important thing is to encourage a culture where clear communication is the norm.

Newport is advocating for a more linear approach to workflows. People need to completely stop one task in order to fully transition their thought processes to the next one. However, this is hard when we are constantly seeing emails or being reminded about previous tasks. Some of our thoughts are still on the previous work – an effect called attention residue.”

That resonates completely with me. So let’s consider that and look into the collaboration process of a stylised order activation:

  1. Customer places an order via an order-entry portal
  2. Perform SQ (Service Qualification) and Credit Checks, automated processes
  3. Order is broken into work order activities (automated process)
  4. Designer1 picks up design work order activity from activity list and commences outside plant design (cables, pits, pipes). Her design pack includes:
    1. Updating AutoCAD / GIS drawings to show outside plant (new cable in existing pit/pipe, plus lead-in cable)
    2. Updating OSS to show splicing / patching changes
    3. Creates project BoQ (bill of quantities) in a spreadsheet
  5. Designer2 picks up next work order activity from activity list and commences active network design. His design pack includes:
    1. Allocation of CPE (Customer Premises Equipment) from warehouse
    2. Allocation of IP address from ranges available in IPAM (IP address manager)
    3. Configuration plan for CPE and network edge devices
  6. FieldWorkTeamLeader reviews inside plant and outside plant designs and allocates to FieldWorker1. FieldWorker1 is also issued with a printed design pack and the required materials
  7. FieldWorker1 commences build activities and finds out there’s a problem with the design. It indicates splicing the customer lead-in to fibres 1/2, but they appear to already be in use

So, what does FieldWorker1 do next?

The activity list / queue process has worked reasonably well up until this step in the process. It allowed each person to work autonomously, stay in deep focus and in the sequence of their own choosing. But now, FieldWorker1 needs her issue resolved within only a few minutes or must move on to her next job (and next site). That would mean an additional truck-roll, but also annoying the customer who now has to re-schedule and take an additional day off work to open their house for the installer.

FieldWorker1 now needs to collaborate quickly with Designer1, Designer2 and FieldWorkTeamLeader. But most OSS simply don’t provide the tools to do so. The go-forward decision in our example draws upon information from multiple sources (ie AutoCAD drawing, GIS, spreadsheet, design document, IPAM and the OSS). Not only that, but the print-outs given to the field worker don’t reflect real-time changes in data. Nor do they give any up-stream context that might help her resolve this issue.

So FieldWorker1 contacts the designers directly (and separately) via phone.

Designer1 and Designer2 have to leave deep-think mode to respond urgently to the notification from FieldWorker1 and then take minutes to pull up the data. Designer1 and Designer2 have to contact each other about conflicting data sets. Too much time passes. FieldWorker1 moves to her next job.

Our challenge as OSS designers is to create a collaborative workspace that has real-time access to all data (not just the local context as the issue probably lies in data that’s upstream of what’s shown in the design pack). Our workspace must also provide all participants with the tools to engage visually and aurally – to choreograph head-office and on-site resources into “group flow” to resolve the issue.

Even if such tools existed today, the question I still have is how we ensure our designers aren’t interrupted from their all-important deep-think mode. How do we prevent them from having to drop everything multiple times a day/hour? Perhaps the answer is in an organisational structure – where all designers have to cycle through the Design Support function (eg 1 day in a fortnight), to take support calls from field workers and help them resolve issues. It will give designers a greater appreciation for problems occurring in the field and also help them avoid responding to emails, slack messages, etc when in design mode.

 

Lightning strikes in OSS

Operators have developed many unique understandings of what impacts the health of their networks.

For example, mobile operators know that they have faster maintenance cycles in coastal areas than they do in warm, dry areas (yes, due to rust). Other operators have a high percentage of faults that are power-related. Others are impacted by failures caused by lightning strikes.

Near-real-time weather pattern and lightning strike data is now readily accessible, potentially for use by our OSS.

I was just speaking with one such operator last week who said, “We looked at it [using lightning strike data] but we ended up jumping at shadows most of the time. We actually started… looking for DSLAM alarms which will show us clumps of power failures and strikes, then we investigate those clumps and determine a cause. Sometimes we send out a single truck to collect artifacts, photos of lightning damage to cables, etc.”

That discussion got me wondering about what other lateral approaches are used by operators to assure their networks. For example:

  1. What external data sources do you use (eg meteorology, lightning strike, power feed data from power suppliers or sensors, sensor networks, etc)
  2. Do you use it in proactive or reactive mode (eg to diagnose a fault or to use engineering techniques to prevent faults)
  3. Have you built algorithms (eg root-cause, predictive maintenance, etc) to utilise your external data sources
  4. If so, do those algorithms help establish automated closed-loop detect and response cycles
  5. By measuring and managing, has it created quantifiable improvements in your network health

I’d love to hear about your clever and unique insight-generation ideas. Or even the ideas you’ve proposed that haven’t been built yet.

282 million reasons for increased OSS/BSS scrutiny

The hotel group Marriott International has been told by the UK Information Commissioner’s Office that it will be fined a little over £99 million (A$178 million) over a data breach that occurred in December last year…
This is the second fine for data breaches announced by the ICO on successive days. On Monday, it said British Airways would be fined £183.39 million (A$329.1 million) for a data breach that occurred in September 2018
.”
Sam Varghese of ITwire.

The scale of the fines issued to Marriott and BA is mind-boggling.

Here’s a link to the GDPR (General Data Protection Regulation) fine regime and determination process. Fines can be issued by GDPR policing agencies of up to €20 million, or 4% of the worldwide annual revenue of the prior financial year, whichever is higher.

Determination is based on the following questions:

  1. Nature of infringement: number of people affected, damaged they suffered, duration of infringement, and purpose of processing
  2. Intention: whether the infringement is intentional or negligent
  3. Mitigation: actions taken to mitigate damage to data subjects
  4. Preventative measures: how much technical and organizational preparation the firm had previously implemented to prevent non-compliance
  5. History: (83.2e) past relevant infringements, which may be interpreted to include infringements under the Data Protection Directive and not just the GDPR, and (83.2i) past administrative corrective actions under the GDPR, from warnings to bans on processing and fines
  6. Cooperation: how cooperative the firm has been with the supervisory authority to remedy the infringement
  7. Data type: what types of data the infringement impacts; see special categories of personal data
  8. Notification: whether the infringement was proactively reported to the supervisory authority by the firm itself or a third party
  9. Certification: whether the firm had qualified under approved certifications or adhered to approved codes of conduct
  10. Other: other aggravating or mitigating factors may include financial impact on the firm from the infringement

The two examples listed above provide 282 million reasons for governments to police data protection more stringently than they do today. The regulatory pressure is only going to increase right? As I understand it, these processes are only enforced in reactive mode currently. What if the regulators become move to proactive mode?

Question for you – Looking at #7 above, do you think the customer information stored in your OSS/BSS is more or less “impactful” than that of Marriott or British Airways?

Think about this question in terms of the number of daily interactions you have with hotels and airlines versus telcos / ISPs. I’ve stayed in Marriott hotels for over a year in accumulated days. I’ve boarded hundreds of flights. But I can’t begin to imagine how many of my data points the telcos / ISP could potentially collect every day. It’s in our OSS/BSS data stores where those data points are most likely to end up.

Do you think our OSS/BSS are going to come under increasing GDPR-like scrutiny in coming years? Put it this way, I suspect we’re going to become more familiar with risk management around the 10 dot points above than we have been in the past.

Step-by-step guide to build a systematic root-cause analysis (RCA) pipeline

Fault / Alarm management tools have lots of strings to their functionality bows to help operators focus in on the target/s that matter most. ITU-T’s recommendation X.733 provided an early framework and common model for classification of alarms. This allowed OSS vendors to build a standardised set of filters (eg severity, probable cause, etc). ITU-T’s recommendation M.3703 then provided a set of guiding use cases for managing alarms. These recommendations have been around since the 1990’s (or possibly even before).

Despite these “noise reduction” tools being readily available, they’re still not “compressing” event lists enough in all cases.

I imagine, like me, you’ve heard many customer stories where so many new events are appearing in an event list each day that the NOC (network operations centre) just can’t keep up. Dozens of new events are appearing on the screen, then scrolling off the bottom of it before an operator has even had a chance to stop and think about a resolution.

So if humans can’t keep up with the volume, we need to empower machines with their faster processing capabilities to do the job. But to do so, we first have to take a step away from the noise and help build a systematic root-cause analysis (RCA) pipeline.

I call it a pipeline because there are generally a lot of RCA rules that are required. There are a few general RCA rules that can be applied “out of the box” on a generic network, but most need to be specifically crafted to each network.

So here’s a step-by-step guide to build your RCA pipeline:

  1. Scope – Identify your initial target / scope. For example, what are you seeking to prioritise:
    1. Event volume reduction to give the NOC breathing space to function better
    2. Identifying “most important” events (but defining what is most important)
    3. Minimising SLA breaches
    4. etc
  2. Gather Data – Gather incident and ticket data. Your OSS is probably already doing this, but you may need to pull data together from various sources (eg alarms/events, performance, tickets, external sources like weather data, etc)
  3. Pattern Identification – Pattern identification and categorisation of incidents. This generally requires a pattern identification tool, ideally supplied by your alarm management and/or analytics supplier
  4. Prioritise – Using a long-tail graph like below, prioritise pattern groups by the following (and in line with item #1 above):
      1. Number of instances of the pattern / group (ie frequency)
      2. Priority of instances (ie urgency of resolution)
      3. Number of linked incidents (ie volume)
      4. Other technique, such as a cumulative/blended metric

  5. Gather Resolution Knowledge – Understand current NOC approaches to fault-identification and triage, as well as what’s important to them (noting that they may have biases such as managing to vanity metrics)
  6. Note any Existing Resolutions – Identify and categorise any existing resolutions and/or RCA rules (if data supports this)
  7. Short-list Remaining Patterns – Overlay resolution pattern on long-tail (to show which patterns are already solved for). then identify remaining priority patterns on the long-tail that don’t have a resolution yet.
  8. Codify Patterns – Progressively set out to identify possible root-cause by analysing cause-effect such as:
    1. Topology-based
    2. Object hierarchy
    3. Time-based ripple
    4. Geo-based ripple
    5. Other (as helped to be defined by NOC operators)
  9. Knowledge base – Create a knowledge base that itemises root-causes and supporting information
  10. Build Algorithm / Automation – Create an algorithm for identifying root-cause and related alarms. Identify level of complexity, risks, unknowns, likelihood, control/monitoring plan for post-install, etc. Then build pilot algorithm (and possibly roll-back technique??). This might not just be an RCA rule, but could also include other automations. Automations could include creating a common problem and linking all events (not just root cause event but all related events), escalations, triggering automated workflows, etc
  11. Test pilot algorithm (with analytics??)
  12. Introduce algorithm into production use – But continue to monitor what’s being suppressed to
  13. Repeat – Then repeat from steps 7 to 12 to codify the next most important pattern
  14. Leading metrics – Identify leading metrics and/or preventative measures that could precede the RCA rule. Establish closed-loop automated resolution
  15. Improve – Manage and maintain process improvement

What if most OSS/BSS are overkill? Planning a simpler version

You may recall a recent article that provided a discussion around the demarcation between OSS and BSS, which included the following graph:

Note that this mapping is just my demarc interpretation, but isn’t the definitive guide. It’s definitely open to differing opinions (ie religious wars).

Many of you will be familiar with the framework that the mapping is overlaid onto – TM Forum’s TAM (The Application Map). Version R17.5.1 in this case. It is as close as we get to a standard mapping of OSS/BSS functionality modules. I find it to be a really useful guide, so today’s article is going to call on the TAM again.

As you would’ve noticed in the diagram above, there are many, many modules that make up the complete OSS/BSS estate. And you should note that the diagram above only includes Level 2 mapping. The TAM recommendation gets a lot more granular than this. This level of granularity can be really important for large, complex telcos.

For the OSS/BSS that support smaller telcos, network providers or utilities, this might be overkill. Similarly, there are OSS/BSS vendors that want to cover all or large parts of the entire estate for these types of customers. But as you’d expect, they don’t want to provide the same depth of functionality coverage that the big telcos might need.

As such, I thought I’d provide the cut-down TAM mapping below for those who want a less complex OSS/BSS suite.

It’s a really subjective mapping because each telco, provider or vendor will have their own perspective on mandatory features or modules. Hopefully it provides a useful starting point for planning a low complexity OSS/BSS.

Then what high-level functionality goes into these building blocks? That’s possibly even more subjective, but here are some hints:

OSS that repair virtualised networks – the dual loop approach

In a recent article, we talked about Network Service Assurance (NSA) in an environment where network virtualisation exists.

One of the benefits of virtualisation or NaaS (Network as a Service) is that it provides a layer of programmability to your network. That is, to be able to instantiate network services by software through a network API. Virtualisation also tends to assume/imply that there is a huge amount of available capacity (the resource pool) that it can shift workloads between. If one virtual service instance dies or deteriorates, then just automatically spin up another. If one route goes down, customer services are automatically re-directed via alternate routes and the service is maintained. No problem…

But there are some problems that can’t be solved in software. You can’t just use software to fix a cable that’s been cut by an excavator. You can’t just use software to fix failed electronics. Modern virtualised networks can do a great job of self-healing, routing around the problem areas. But there are still physical failures that need to be repaired / replaced / maintained by a field workforce. NSA doesn’t tend to cover that.

Looking at the diagram below, NSA does a great job of the closed-loop assurance within the red circle. But it then needs to kick out to the green closed-loop assurance processes that are already driven by our OSS/BSS.

As described in the link above, “Perhaps if the NSA was just assuring the yellow cloud/s, any time it identifies any physical degradation / failure in the resource pool, it kicks a notification up to the Customer Service Assurance (CSA) tools in the OSS/BSS layers? The OSS/BSS would then coordinate 1) any required customer notifications and 2) any truck rolls or fixes that can’t be achieved programmatically; just like it already does today. The additional benefit of this two-tiered assurance approach is that NSA can handle the NFV / VNF world, whilst not trying to replicate the enormous effort that’s already been invested into the CSA (ie the existing OSS/BSS assurance stack that looks after PNFs, other physical resources and the field workforce processes that look after it all).

Therefore, a key part of the NSA process is how it kicks up from closed-loop 1 to closed-loop 2. Then, after closed-loop 2 has repaired the physical problem, NSA needs to be aware that the repaired resource is now back in the pool of available resources. Does your NSA automatically notice this, or must it receive a notification from closed loop 2?

It could be as simple as NSA sending alarms into the alarm list with a clearly articulate root-cause. The alarm has a ticket/s raised against it. The ticket triggers the field workforce to rectify it and the triggers customer assurance teams/tools to send notifications to impacted customers (if indeed they send notifications to customers who may not actually be effected yet due to the resilience measures that have kicked in). Standard OSS/BSS practice!

OSS change…. but not too much… oh no…..

Let me start today with a question:
Does your future OSS/BSS need to be drastically different to what it is today?

Please leave me a comment below, answering yes or no.

I’m going to take a guess that most OSS/BSS experts will answer yes to this question, that our future OSS/BSS will change significantly. It’s the reason I wrote the OSS Call for Innovation manifesto some time back. As great as our OSS/BSS are, there’s still so much need for improvement.

But big improvement needs big change. And big change is scary, as Tom Nolle points out:
IT vendors, like most vendors, recognize that too much revolution doesn’t sell. You have to creep up on change, get buyers disconnected from the comfortable past and then get them to face not the ultimate future but a future that’s not too frightening.”

Do you feel like we’re already in the midst of a revolution? Cloud computing, web-scaling and virtualisation (of IT and networks) have been partly responsible for it. Agile and continuous integration/delivery models too.

The following diagram shows a “from the moon” level view of how I approach (almost) any new project.

The key to Tom’s quote above is in step 2. Just how far, or how ambitious, into the future are you projecting your required change? Do you even know what that future will look like? After all, the environment we’re operating within is changing so fast. That’s why Tom is suggesting that for many of us, step 2 is just a “creep up on it change.” The gap is essentially small.

The “creep up on it change” means just adding a few new relatively meaningless features at the end of the long tail of functionality. That’s because we’ve already had the most meaningful functionality in our OSS/BSS for decades (eg customer management, product / catalog management, service management, service activation, network / service health management, inventory / resource management, partner management, workforce management, etc). We’ve had the functionality, but that doesn’t mean we’ve perfected the cost or process efficiency of using it.

So let’s say we look at step 2 with a slightly different mindset. Let’s say we don’t try to add any new functionality. We lock that down to what we already have. Instead we do re-factoring and try to pull the efficiency levers, which means changes to:

  1. Platforms (eg cloud computing, web-scaling and virtualisation as well as associated management applications)
  2. Methodologies (eg Agile, DevOps, CI/CD, noting of course that they’re more than just methodologies, but also come with tools, etc)
  3. Process (eg User Experience / User Interfaces [UX/UI], supply chain, business process re-invention, machine-led automations, etc)

It’s harder for most people to visualise what the Step 2 Future State looks like. And if it’s harder to envisage Step 2, how do we then move onto Steps 3 and 4 with confidence?

This is the challenge for OSS/BSS vendors, supplier, integrators and implementers. How do we, “get buyers disconnected from the comfortable past and then get them to face not the ultimate future but a future that’s not too frightening?” And I should point out, that it’s not just buyers we need to get disconnected from the comfortable past, but ourselves, myself definitely included.

Network Service Assurance has new meaning

Back in the old days, Network Service Assurance probably had a different meaning than it might today.

Clearly it’s assurance of a network service. That’s fairly obvious. But it’s in the definition of “network service” where the old and new terminologies have the potential to diverge.

In years past, telco networks were “nailed up” and network functions were physical appliances. I would’ve implied (probably incorrectly, but bear with me) that a “network service” was “owned” by the carrier and was something like a bearer circuit  (as distinct from a customer service or customer circuit). Those bearer circuits, using protocols such as in DWDM, SDH, SONET, ATM, etc potentially carried lots of customer circuits so they were definitely worth assuring. And in those nailed-up networks, we knew exactly which network appliances / resources / bearers were being utilised. This simplified service impact analysis (SIA) and allowed targeted fault-fix.

In those networks the OSS/BSS was generally able to establish a clear line of association from customer service to physical resources as per the TMN pyramid below. Yes, some abstraction happened as information permeated up the stack, but awareness of connectivity and resource utilisation was generally retained end-to-end (E2E).
OSS abstract and connect

But in the more modern computer or virtualised network, it all goes a bit haywire, perhaps starting right back at the definition of a network service.

The modern “network service” is more aligned to ETSI’s NFV definition – “a composition of network functions and defined by its functional and behavioral specification. The Network Service contributes to the behaviour of the higher layer service, which is characterised by at least performance, dependability, and security specifications. The end-to-end network service behaviour is the result of a combination of the individual network function behaviours as well as the behaviours of the network infrastructure composition mechanism.”

They are applications running at OSI’s application layer that can be consumed by other applications. These network services include DNS, DHCP, VoIP, etc, but the concept of NaaS (Network as a Service) expands the possibilities further.

So now the customer services at the top of the pyramid (BSS / BML) are quite separated from the resources at the physical layer, other than to say the customer services consume from a pool of resources (the yellow cloud below). Assurance becomes more disconnected as a result.

BSS OSS cloud abstract

OSS/BSS are able to tie customer services to pools of resources (the yellow cloud). And OSS/BSS tools also include PNI / WFM (Physical Network Inventory / Workforce Management) to manage the bottom, physical layer. But now there’s potentially an opaque gulf in the middle where virtualisation / NaaS exists.

The end-to-end association between customer services and the physical resources that carry them is lost. Unless we can find a way to establish E2E association, we just have to hope that our modern Network Service Assurance (NSA) tools make the yellow cloud robust to the point of infallibility. BTW. If the yellow cloud includes NaaS, then the NSA has to assure the NaaS gateway, catalog and all services instantiated through the gateway.

But as we know, there will always be failures in physical infrastructure (cable cuts, electronic malfunctions, etc). The individual resources can’t afford to be infallible, even if the resource pool seeks to provide collective resiliency.

Modern NSA has to find a way to manage the resource pool but also coordinate fault-fix in the physical resources that underpin it like the OSS used to do (still do??). They have to do more than just build policies and actions to ensure SLAs don’t they? They can seek to manage security, power, performance, utilisation and more. Unfortunately, not everything can be fixed programmatically, although that is a great place for NSA to start.

Perhaps if the NSA was just assuring the yellow cloud, any time it identifies any physical degradation / failure in the resource pool, it kicks a notification up to the Customer Service Assurance (CSA) tools in the OSS/BSS layers? The OSS/BSS would then coordinate 1) any required customer notifications and 2) any truck rolls or fixes that can’t be achieved programmatically; just like it already does today. The additional benefit of this two-tiered assurance approach is that NSA can handle the NFV / VNF world, whilst not trying to replicate the enormous effort that’s already been invested into the CSA (ie the existing OSS/BSS assurance stack that looks after PNFs, other physical resources and the field workforce processes that look after it all).

I’d love to hear your thoughts. Hopefully you can even correct me if/where I’m wrong.

Auto-releasing chaos monkeys to harden your network (CT/IR)

In earlier posts, we’ve talked about using Netflix’s chaos monkey approach as a way of getting to Zero Touch Assurance (ZTA). The chaos monkeys intentionally trigger faults in the network as a means of ensuring resilience. Not just for known degradation / outage events, but to unknown events too.

I’d like to introduce the concept of CT/IR – Continual Test / Incremental Resilience. Analogous to CI/CD (Continuous Integration / Continuous Delivery) before it, CT/IR is a method to systematically and programmatically test the resilience of the network, then ensuring resilience is continually improving.

The continual, incremental improvement in resiliency potentially comes via multiple feedback loops:

  1. Ideally, the existing resilience mechanisms work around or overcome any degradation or failure in the network
  2. The continual triggering of faults into the network will provide additional seed data for AI/ML tools to learn from and improve upon, especially root-cause analysis (noting that in the case of CT/IR, the root-cause is certain – we KNOW the cause – because we triggered it – rather than reverse engineering what the cause may have been)
  3. We can program the network to overcome the problem (eg turn up extra capacity, re-engineer traffic flows, change configurations, etc). Having the NaaS that we spoke about yesterday, provides greater programmability for the network by the way.
  4. We can implement systematic programs / projects to fix endemic faults or weak spots in the network *
  5. Perform regression tests to constantly stress-test the network as it evolves through network augmentation, new device types, etc

Now, you may argue that no carrier in their right mind will allow intentional faults to be triggered. So that’s where we unleash the chaos monkeys on our digital twin technology and/or PSUP (Production Support) environments at first. Then on our prod network if we develop enough trust in it.

I live in Australia, which suffers from severe bushfires every summer. Our fire-fighters spend a lot of time back-burning during the cooler months to reduce flammable material and therefore the severity of summer fires. Occasionally the back-burns get out of control, causing problems. But they’re still done for the greater good. The same principle could apply to unleashing chaos monkeys on a production network… once you’re confident in your ability to control the problems that might follow.

* When I say network, I’m also referring to the physical and logical network, but also support functions such as EMS (Element Management Systems), NCM (Network Configuration Management tools), backup/restore mechanisms, service order replay processes in the event of an outage, OSS/BSS, NaaS, etc.

Where does BSS end and OSS begin?

Over the years, I’ve been asked the question many times, “what’s the difference between OSS (Operational Support Systems) and BSS (Business Support Systems)?” I’ve also been asked, albeit slightly less regularly, how OSS and BSS map to TM Forum standards like the TAM and eTOM.

To my knowledge, TM Forum has never attempted to map OSS vs BSS. It sets off too many religious wars.

Just for fun, I thought I’d have a crack at trying to map OSS and BSS onto the TAM. Click on the image for a larger PDF version.

OSS and BSS overlaid onto the TAM

I’ve taken the perspective that customer or business-facing functionality is generally considered to be BSS. Alternatively, network / operations-facing functionality is generally considered to be OSS.
And these two tend to overlap at the service layer.

Or, you could just simply call them business operations systems (BOS) that cover the entire TAM estate.

What do you think? Does it trigger a religious war for you? Comments welcomed below.

FWIW. I come from an era when my “OSS” tools had a lot of functionality that could arguably be classified as BSS-centric (eg product management, customer relationship management, service order entry, etc). They also happened to deliver functionality that others might classify as NMS or EMS (Network Management System or Element Management System) in nature. In my mind, they’ve always just been software that supports operationalisation of a network, whether customer or network/resource-facing. It’s one of the reasons this site is called Passionate About OSS, not Passionate About OSS/BSS/NMS/EMS.

Is your OSS squeaking like an un-oiled bearing?

Network operators spend huge amounts on building and maintaining their OSS/BSS every year. There are many reasons they invest so heavily, but in most cases it can be distilled back to one thing – improving operational efficiency.

And our OSS/BSS definitely do improve operational efficiency, but there are still so many sources of friction. They’re squeaking like un-oiled bearings. Here are just a few of the common sources:

  1. First-time Installation
  2. Identifying best-fit tools
  3. Procurement of new tools
  4. Update / release processes
  5. Continuous data quality / consistency improvement
  6. Navigating to all features through the user interface
  7. Non-intuitive functionality / processes
  8. So many variants / complexity that end-users take years to attain expert-level capability
  9. Integration / interconnect
  10. Getting new starters up to speed
  11. Getting proficient operators to expertise
  12. Unlocking actionable insights from huge data piles
  13. Resolving the root-cause of complex faults
  14. Onboarding new customers
  15. Productionising new functionality
  16. Exception and fallout handling
  17. Access to supplier expertise to resolve challenges

The list goes on far deeper than that list too. The challenge for many OSS product teams, for any number of reasons, is that their focus is on adding new features rather than reducing friction in what already exists.

The challenge for product teams is diagnosing where the friction  and risks are for their customers / stakeholders. How do you get that feedback?

  • Every vendor has a product support team, so that’s a useful place to start, both in terms of what’s generating the most support calls and in terms of first-hand feedback from customers
  • Do you hold user forums on a regular basis, where you get many of your customers together to discuss their challenges, your future roadmap, new improvements / features
  • Does your process “flow” data show where the sticking points are for operators
  • Do you conduct gemba walks with your customers
  • Do you have a program of ensuring all developers spend at least a few days a year interacting directly with customers on their site/s
  • Do you observe areas of difficulty when delivering training
  • Do you go out of your way to ask your customers / stakeholders questions that are framed around their pain-points, not just framed within the context of your existing OSS
  • Do you conduct customer surveys? More importantly, do you conduct surveys through an independent third-party?

On the last dot-point, I’ve been surprised at some of the profound insights end-users have shared with me when I’ve been conducting these reviews as the independent interviewer. I’ve tended to find answers are more open / honest when being delivered to an independent third-party than if the supplier asks directly. If you’d like assistance running a third-party review, leave us a note on the contact page. We’d be delighted to assist.

Fast and slow OSS, where uCPE and network virtualisation fits in

Yesterday’s post talked about one of the many dichotomies in OSSfast and slow data / processes.

One of the longer lead-time items in relation to OSS data and processes is in network build and customer connections. From the time when capacity planning or a customer order creates the signal to build, it can be many weeks or months before the physical infrastructure work is complete and appearing in the OSS.

There are two financial downsides to this. Firstly, it tends to be CAPEX-heavy with equipment, construction, truck-rolls, government approvals, etc burning through money. Meanwhile, it’s also a period where there is no money coming in because the services aren’t turned on yet. The time-to-cash cycle of new build (or augmentation) is the bane of all telcos.

This is one of the exciting aspects of network virtualisation for telcos. In a time where connectivity is nearly ubiquitous in most countries, often with high-speed broadband access, physical build becomes less essential (except over-builds). Technologies such as uCPE (Universal Customer Premises Equipment), NFV (Network Function Virtualisation), SD WAN (Software-Defined Wide Area Networks), SDN (Software Defined Networks) and others mean that we can remotely upgrade and reconfigure the network without field work.

Network virtualisation gives the potential to speed up many of the slowest, and costliest processes that run through our OSS… but only if our OSS can support efficient orchestration of virtualised networks. And that means having an OSS with the flexibility to easily change out slow processes to replace them with fast ones without massive overhauls.

Give me a fast OSS and I might ask you to slooooow doooown

The traditional telco (and OSS) ran at different speeds. Some tasks had to happen immediately (eg customers calling one another) while others took time (eg getting a connection to a customer’s home, which included designs, approvals, builds, etc), often weeks.

Our OSS have processes that must happen sequentially and expediently. They also have processes that must wait for dependencies, conditional events and time delays. Some roles need “fast,” others can cope with “slow.” Who wins out in this dilemma?

Even the data we rely on can transact at different speeds. For capacity planning, we’re generally interested in longer-term data. We don’t have to process at real-time. Therefore we can choose to batch process at longer cycle times and with summarised data sets. For network assurance, we’re generally interested in getting data as quick as is viable.

Today’s post is about that word, viable, and pragmatism we sometimes have to apply to our OSS.

For example, if our operations teams want to reduce network performance poll cycles from every 15 mins down to once a minute, we increase the amount of data to process by 15x. That means our data storage costs go up by 15x (assuming a flat-rate cost structure applies). The other hidden cost is that our compute and network costs also go up because we have to transfer and process 15x as much data.

The trade-off we have to make in responses to this rapid escalation of cost (when going from 15 to 1 min) is in the benefits we might derive. Can we avoid SLA (Service Level Agreement) breach costs? Can we avoid costly outages? Can we avoid damage to equipment? Can we reduce the risk of losing our carrier license?

The other question is whether our operators actually have the ability to respond to 15x as much data. Do we have enough people to respond at an increased cycle time? Do we have OSS tools that are capable of filtering what’s important and disregarding “background” activity? Do we have OSS tools that are capable of learning from every single metric (eg AI), at volumes the human brain could never cope with?

Does it make sense that we have a single platform for handling fast and slow processes? For example, do we use the same platform to process 1 minute-cycle performance data for long-term planning (batch-processed once daily) and quick-fire assurance (processed as fast as possible)?

If we stick to one platform, can our OSS apply data reduction techniques (eg selective discard of records) to get the benefits of speed, but with the cost reduction of slow?

Do you wish more people fell in love with your OSS?

I’d hazard a guess that everyone reading this would admit to being a techie at some level. And being a techie, I’d also imagine that you have blatant tech-love for certain products – gadgets, apps, sites, whatever.

But, let me ask you, are there any OSS products on your love-interest list?

If yes, leave me a comment of “yes” and name of the product below.
If no, leave me a comment of “no” below.

I’m really interested and intrigued to see your answer.

There’s probably only one OSS that I’ve ever had a tech-crush on (but it’s no longer available on the market). It definitely wasn’t love at first sight. If I’m honest, it was probably the opposite. It was a love that took a long time to build. It had some cool modules, but generally it was a bit clunky. The real attraction was that the power and elegance of its data model allowed me to do almost anything with it. To build almost anything with it. To answer almost any business / network / operation question that I could dream up.

I wonder whether the same is true of your other tech-loves? Do they provide the platform for us to create/achieve things that we never dreamed we’d be able to?

If that’s true, I wonder then whether that’s one key to solving the header question?

I wonder whether the other key (the second authentication factor) is in the speed that a user can achieve the necessary level of expertise? Few users ever have the luxury that I had, spending every day for years, to establish the required expertise to make that OSS excel.

As Seth Godin says, “Make things better by making better things.”

PS. If you were kind enough to leave a Yes or No comment below, I’d also love to hear why in an additional comment.