I sometimes wonder whether OPEX is underestimated when considering OSS investments, or at least some facets (sorry, awful pun there!) of it.
Cost-out (aka head-count reduction) seems to be the most prominent OSS business case justification lever. So that’s clearly not underestimated. And the move to cloud is also an OPEX play in most cases, so it’s front of mind during the procurement process too. I’m nought for two so far! Hopefully the next examples are a little more persuasive!
Large transformation projects tend to have a focus on the up-front cost of the project, rightly so. There’s also an awareness of ongoing license costs (usually 20-25% of OSS software list price per annum). Less apparent costs can be found in the exclusions / omissions. This is where third-party OPEX costs (eg database licenses, virtualisation, compute / storage, etc) can be (not) found.
That’s why you should definitely consider preparing a TCO (Total Cost of Ownership) model that includes CAPEX and OPEX that’s normalised across all options when making a buying decision.
But the more subtle OPEX leakage occurs through customisation. The more customisation from “off-the-shelf” capability, the greater the variation from baseline, the larger the ongoing costs of maintenance and upgrade. This is not just on proprietary / commercial software, but open-source products as well.
And choosing Agile almost implies ongoing customisation. One of the things about Agile is it keeps adding stuff (apps, data, functions, processes, code, etc) via OPEX. It’s stack-ranked, so it’s always the most important stuff (in theory). But because it’s incremental, it tends to be less closely scrutinised than during a CAPEX / procurement event. Unless carefully monitored, there’s a greater chance for OPEX leakage to occur.
And as we know about OPEX, like diamonds, they’re forever (ie the costs re-appear year after year).
In a post last week we posed the question on whether Inventory Management still retains relevance. There are certainly uses cases where it remains unquestionably needed. But perhaps others that are no longer required, a relic of old-school processes and data flows.
If you have an extensive OSP (Outside Plant) network, you have almost no option but to store all this passive infrastructure in an Inventory Management solution. You don’t have the option of having an EMS (Element Management System) console / API to tell you the current design/location/status of the network.
In the modern world of ubiquitous connection and overlay / virtual networks, Inventory Management might be less essential than it once was. For service qualification, provisioning and perhaps even capacity planning, everything you need to know is available on demand from the EMS/s. The network is a more correct version of the network inventory than external repository (ie Inventory Management) can hope to be, even if you have great success with synchronisation.
But I have a couple of other new-age use-cases to share with you where Inventory Management still retains relevance.
One is for connectivity (okay so this isn’t exactly a new-age use-case, but the scenario I’m about to describe is). If we have a modern overlay / virtual network, anything that stays within a domain is likely to be better served by its EMS equivalent. Especially since connectivity is no longer as simple as physical connections or nearest neighbours with advanced routing protocols. But anything that goes cross-domain and/or off-net needs a mechanism to correlate, coordinate and connect. That’s the role the Inventory Manager is able to do (conceptually).
The other is for digital twinning. OSS (including Inventory Management) was the “original twin.” It was an offline mimic of the production network. But I cite Inventory Management as having a new-age requirement for the digital twin. I increasingly foresee the need for predictive scenarios to be modelled outside the production network (ie in the twin!). We want to try failure / degradation scenarios. We want to optimise our allocation of capital. We want to simulate and optimise customer experience under different network states and loads. We’re beginning to see the compute power that’s able to drive these scenarios (and more) at scale.
Is it possible to handle these without an Inventory Manager (or equivalent)?
“When experts are wrong, it’s often because they’re experts on an earlier version of the world.”
OSS experts are often wrong. Not only because of the “earlier version of the world” paradigm mentioned above, but also the “parallel worlds” paradigm that’s not explicitly mentioned. That is, they may be experts on one organisation’s OSS (possibly from spending years working on it), but have relatively little transferable expertise on other OSS.
It would be nice if the OSS world view never changed and we could just get more and more expert at it, approaching an asymptote of expertise. Alas, it’s never going to be like that. Instead, we experience a world that’s changing across some of our most fundamental building blocks.
“We are the sum total of our experiences.”
My earliest forays into OSS had a heavy focus on inventory. The tie-in between services, logical and physical inventory (and all use-cases around it) was probably core to me becoming passionate about OSS. I might even go as far as saying I’m “an Inventory guy.”
Those early forays occurred when there was a scarcity mindset in network resources. You provisioned what you needed and only expanded capacity within tight CAPEX envelopes. Managing inventory and optimising revenue using these scarce resources was important. We did that with the help of Inventory Management (IM) tools. Even end-users had a mindset of resource scarcity.
But the world has changed. We now operate with a cloud-inspired abundance mindset. We over-provision physical resources so that we can just spin up logical / virtual resources whenever we wish. We have meshed, packet-switched networks rather than nailed up circuits. Generally speaking, cost per resource has fallen dramatically so we now buy a much higher port density, compute capacity, dollar per bit, etc. Customers of the cloud generation assume abundance of capacity that is even available in small consumption-based increments. In many parts of the world we can also assume ubiquitous connectivity.
So, as “an inventory guy,” I have to question whether the scarcity to abundance transformation might even fundamentally change my world-view on inventory management. Do I even need an inventory management solution or should I just ask the network for resources when I want to turn on new customers and assume the capacity team has ensured there’s surplus to call upon?
Is the enormous expense we allocate to building and reconciling a digital twin of the network (ie the data gathered and used by Inventory Management) justified? Could we circumvent many of the fallouts (and a multitude of other problems) that occur because the inventory data doesn’t accurately reflect the real network?
For example, in the old days I always loved how much easier it was to provision a customer’s mobile / cellular or IN (Intelligent Network) service than a fixed-line service. It was easier because fixed-line service needed a whole lot more inventory allocation and reservation logic and process. Mobile / IN services didn’t rely on inventory, only an availability of capacity (mostly). Perhaps the day has almost come where all services are that easy to provision?
Yes, we continue to need asset management and capacity planning. Yes, we still need inventory management for physical plant that has no programmatic interface (eg cables, patch-panels, joints, etc). Yes, we still need to carefully control the capacity build-out to CAPEX to revenue balance (even more so now in a lower-profitability operator environment). But do many of the other traditional Inventory Management and resource provisioning use cases go away in a world of abundance?
I’d love to hear your opinions, especially from all you other “inventory guys” (and gals)!! Are your world-views, expertise and experiences changing along these lines too or does the world remain unchanged from your viewing point?
The following is a set of user stories I’ve provided to TM Forum to help with their current Autonomous Networking initiative.
They’re just an initial discussion point for others to riff off. We’d love to get your comments, additions and recommended refinements too.
As a Head of Network Operations, I want to Automatically maintain the health of my network (within expected tolerances if necessary) So that Customer service quality is kept to an optimal level with little or no human intervention
As a Head of Network Operations, I want to Ensure the overall solution is designed with machine-led automations as a guiding principle So that Human intervention can not be easily engineered into the systems/processes
As a Head of Network Operations, I want to Automatically identify any failures of resources or services within the entire network So that All relevant data can be collected, logged, codified and earmarked for effective remedial action without human interaction
As a Head of Network Operations, I want to Automatically identify any degradation of resource or service performance within the network So that All relevant data can be collected, logged, codified and earmarked for effective remedial action without human interaction
As a Head of Network Operations, I want to Map each codified data set (for failure or degradation cases) to a remedial action plan So that Remedial activities can be initiated without human interaction
As a Head of Network Operations, I want to Identify which remedial activities can be initiated via a programmatic interface and which activities require manual involvement such as a truck roll So that Even manual activities can be automatically initiated
As a Head of Network Operations, I want to Ensure that automations are able to resolve all known failure / degradation scenarios So that Activities can be initiated for any failure or degradation and be automatically resolved through to closure (with little or no human intervention)
As a Head of Network Operations, I want to Ensure there is sufficient network resilience So that Any failure or degradation can be automatically bypassed (temporarily or permanently)
As a Head of Network Operations, I want to Ensure there is sufficient resilience within all support systems So that Any failure or degradation can be automatically bypassed (temporarily or permanently) to ensure customer service is maintained
As a Head of Network Operations, I want to Ensure that operator initiated changes (eg planned maintenance, software upgrades, etc) automatically generate change tracking, documentation and logging So that The change can be monitored (by systems and humans where necessary) to ensure there is minimal or no impact to customer services, but also to ensure resolution data is consistently recorded
As a Head of Network Operations, I want to Ensure that customer initiated changes (eg by raising an incident) automatically generate change tracking, documentation and logging So that The change can be monitored (by systems and humans where necessary) to ensure the incident is closed expediently, but also to ensure resolution data is consistently recorded
As a Head of Network Operations, I want to Initiate planned outages with or without triggering automated remedial activities So that The change agents can decide to use automations or not and ensure automations don’t adversely effect the activities that are scheduled for the planned outage window
As a Head of Network Operations, I want to Ensure that if an unplanned outage does occur, impacted customers are automatically notified (on first instance and via a communications sequence if necessary throughout the outage window) So that Customer experience can be managed as best possible
As a Head of Network Operations, I want to Ensure that if an unplanned outage does occur without a remedial action being triggered, a post-mortem analysis is initiated So that Automations can be revised to cope with this previously unhandled outage scenario
As a Head of Network Operations, I want to Ensure that even previously un-seen new fail scenarios can be handled by remedial automations So that Customer service quality is kept to an optimal level with little or no human intervention
As a Head of Network Operations, I want to Automatically monitor the effects of remedial actions So that Remedial automations don’t trigger race conditions that result in further degradation and/or downstream impacts
As a Head of Network Operations, I want to Be able to manually override any automations by following a documented sequence of events So that If a race condition is inadvertently triggered by an automation, it can be negated quickly and effectively before causing further degradation
As a Head of Network Operations, I want to Intentionally trigger network/service outages and/or degradations, including cascaded scenarios on an scheduled and/or randomised basis So that The resilience of the network and systems can be thoroughly tested (and improved if necessary)
As a Head of Network Operations, I want to Intentionally trigger network/service outages and/or degradations, including cascaded scenarios on an ad-hoc basis So that The resilience of the network and systems can be thoroughly tested (and improved if necessary)
As a Head of Network Operations, I want to Perform scheduled compliance checks on the network So that Expected configurations and policies are in place across the network
As a Head of Network Operations, I want to Automatically generate scheduled reports relating to the effectiveness of the network, services and automations So that The overall solution health (including automations) can be monitored
As a Head of Network Operations, I want to Automatically generate dashboards (in near-real-time) relating to the effectiveness of the network, services and automations So that The overall solution health (including automations) can be monitored
As a Head of Network Operations, I want to Ensure that automations are able to extend across all domains within the solution So that Remedial actions aren’t constrained by system hand-offs
As a Head of Network Operations, I want to Ensure configuration backups are performed automatically on all relevant systems (eg EMS, OSS, etc) So that A recent good solution configuration can be stored as protection in case automations fail and corrupt configurations within the system
As a Head of Network Operations, I want to Ensure configuration restores are performed and tested automatically on all relevant systems (eg EMS, OSS, etc) So that A recent good solution configuration can be reverted to in case automations fail and corrupt configurations within the system
As a Head of Network Operations, I want to Ensure automations are able to manage the entire service lifecycle (add, modify/upgrade, suspend, restore, delete) So that Customer services can evolve to meet client expectations with little or no human intervention
As a Head of Network Operations, I want to Have a design and architecture that uses intent-based and/or policy-based actions So that The complexity of automations is minimised (eg automations don’t need to consider custom rules for different device makes/models, etc)
As a Head of Network Operations, I want to Ensure as many components of the solution (eg EMS, OSS, customer portals, etc) have programmatic interfaces (even if manual activities are required in back-end processes) So that Automations can initiate remedial actions in near real time
As a Head of Network Operations, I want to Ensure all components and data flows within the solution are securely hardened (eg encryption of data in motion and at rest) So that The power of the autonomous platform can not be leveraged for nefarious purposes
As a Head of Network Operations, I want to Ensure that all required metrics can be automatically sourced from the network / systems in as near real time as feasible / useful So that Automations have the full set of data they need to initiate remedial actions and it is as up-to-date as possible for precise decision-making
As a Head of Network Operations, I want to Use the power of learning machines So that The sophistication and speed of remedial response is faster, more accurate and more reliable than if manual interaction were used
As a Head of Network Operations, I want to Record actual event patterns and replay scenarios offline So that Event clusters and response patterns can be thoroughly tested as part of the certification process prior to being released into production environments
As a Head of Network Operations, I want to Capture metrics that can be cross-referenced against event patterns and remedial actions So that Regressions and/or refinements can improve existing automations (ie continuous retraining of the model)
As a Head of Network Operations, I want to Be able to seed a knowledge base with relevant event/action data, whether the pattern source is from Production, an offline environment, a digital twin environment or other production-like environments So that The database is able to identify real scenarios, even if scenarios are intentially initiated, but could potentially cause network degradation to a production environment
As a Head of Network Operations, I want to Ensure that programmatic interfaces also allow for revert / rollback capabilities So that Remedial actions that aren’t beneficial can be rolled back to the previous state; OR other remedial actions are performed, allowing the automation to revert to original configuration / state
As a Head of Network Operations, I want to Be able to initiate circuit breakers to override any automations So that If a race condition is inadvertently triggered by an automation, it can be negated quickly and effectively before causing further degradation
As a Head of Network Operations, I want to Manually or automatically generate response-plans (ie documented sequences of activities) for any remedial actions fed back into the system So that Internal (eg quality control) or external (eg regulatory) bodies can review “best-practice” remedial activities at any point in time
As a Head of Network Operations, I want to Intentionally trigger catastrophic network failures (in non-prod environments) So that We can trial many remedial actions until we find an optimal solution to seed the knowledge base with
Seems this post from last week has triggered some really interesting debate – Is your service assurance really service assurance?? (Part 5). It was a post that looked into collecting end-to-end service metrics rather than our traditional method of collecting network device events/metrics and trying to reverse-engineer to form a service-level perspective.
Thought I’d give you an update. I’m thinking along the following lines, but admit that I don’t have it all worked out by any means yet:
We need to concept of span like OpenTelemetry does between microservices (in a way, it’s like nearest-neighbour of where each packet is getting pushed).
Note that for us a span is on a service-by-service basis between nodes, not just a network link-by-link basis between nodes
We need to be able to measure the real-time metrics of the performance of each span as well as any events/faults impacting them
One challenge (one of probably many) is how to avoid flooding the data/management planes. Possibly a telemetry beacon at each node that’s aggregating performance/events of each packet passed for each service?? But what aggregation-window / cache-size to use? Still too impossibly huge to process except with ridiculously low sampling rates??
By chaining the spans we get a real-time, end-to-end trace of services and the performance (and real-time snapshot of service-by-service resource usage in a packet-switched network)
How to efficiently get the beacon data to a centralised logging/management point? Send beacons via management plane? Send via data plane? Take an approach similar to Netflow / IPFIX-style protocols?
How to store data for a short period (ie for real-time analysis/reporting) as well as for long periods. Due to volumes, we’d have to apply aging policies to the data, but it would still be valuable for the purpose of mid and long-term SLA, network health, optimisation, capacity management, etc
As you can see, there are still so many wide-open questions about the feasibility of the concept. But getting feedback from multiple very clever people who read this blog is definitely helping! Thank you!!
I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because just as you have the concept of traces / metrics / logs for apps, you similarly have traces / metrics / logs for networks.
In the network world, we’re good at getting metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains) as described earlier in this blog series. And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering logs / events, which seems a bit backward to me.
Take a closer look at the OpenTelemetry link above, which provides an overview of how their team is going to gather application telemetry. With increasing software-ification of our networks (eg SDN / NFV) and the use of microservices / NaaS / APIs in our management stacks, could this actually be our path to the holy grail of service assurance (ie capturing trace data – network service telemetry)?? Is it data plane? Is it control / management plane? Is it something in between?
Note: The “active measurements” approach described in part 3 is slightly compromised in current form, which is why I’m so intrigued by the potential of extending the concepts of OpenTelemetry into our software / virtual networks.
I’d really love your take on this one because I’m sure there are many elements to this that I haven’t thought through yet. Please leave your thoughts on the viability of the approach.
“Whatever is well conceived is clearly said,
And the words to say it flow with ease.”
I’d like to hijack this quote and re-direct it towards architectures. Could we equally state that a well conceived architecture can be clearly understood? Some modern OSS/IT frameworks that I’ve seen recently are hugely complex. The question I’ve had to ponder is whether they’re necessarily complex. As the aphorism states, “Everything should be made as simple as possible, but not simpler.”
Just take in the complexity of this triptych I prepared to overlay SDN, NFV and MANO frameworks.
Yet this is only a basic model. It doesn’t consider networks with a blend of PNF and VNF (Physical and Virtual Network Functions). It doesn’t consider closed loop assurance. It doesn’t consider other automations, or omni-channel, or etc, etc.
Yesterday’s post raised an interesting concept from Tom Nolle that as our solutions become more complex, our ability to make a basic assessment of value becomes more strained. And by implication, we often need to upskill a team before even being able to assess the value of a proposed project.
It seems to me that we need simpler architectures to be able to generate persuasive business cases. But it poses the question, do they need to be complex or are our solutions just not well enough conceived yet?
To borrow a story from Wikiquote, “Richard Feynman, the late Nobel Laureate in physics, was once asked by a Caltech faculty member to explain why spin one-half particles obey Fermi Dirac statistics. Rising to the challenge, he said, “I’ll prepare a freshman lecture on it.” But a few days later he told the faculty member, “You know, I couldn’t do it. I couldn’t reduce it to the freshman level. That means we really don’t understand it.“
“…as technology gets more complicated, it becomes more difficult for buyers to acquire the skills needed to make even a basic assessment of value. Without such an assessment, it’s hard to get a project going, and in particular hard to get one going the right way.”
Have you noticed that over the last few years, OSS choice has proliferated, making project assessment more challenging? Previously, the COTS (Commercial Off-the-Shelf) product solution dominated. That was already a challenge because there are hundreds to choose from (there are around 400 on our vendors page alone). But that’s just the tip of the iceberg.
We now also have choices to make across factors such as:
Building OSS tools with open-source projects
An increasing amount of in-house development (as opposed to COTS implementations by the product’s vendors)
Smaller niche products that need additional integration
An increase in the number of “standards” that are seeking to solve traditional OSS/BSS problems (eg ONAP, ETSI’s ZSM, TM Forum’s ODA, etc, etc)
Revolutions from the IT world such as cloud, containerisation, virtualisation, etc
As Tom indicates in the quote above, the diversity of skills required to make these decisions is broadening. Broadening to the point where you generally need a large team to have suitable skills coverage to make even a basic assessment of value.
At Passionate About OSS, we’re seeking to address this in the following ways:
We have two development projects underway (more news to come)
One to simplify the vendor / product selection process
One to assist with up-skilling on open-source and IT tools to build modern OSS
In addition to existing pages / blogs, we’re assembling more content about “standards” evolution, which should appear on this blog in coming days
Use our “Finding an Expert” tool to match experts to requirements
And of course there are the variety of consultancy services we offer ranging from strategy, roadmap, project business case and vendor selection through to resource identification and implementation. Leave us a message on our contact page if you’d like to discuss more
One of the benefits of virtualisation or NaaS (Network as a Service) is that it provides a layer of programmability to your network. That is, to be able to instantiate network services by software through a network API. Virtualisation also tends to assume/imply that there is a huge amount of available capacity (the resource pool) that it can shift workloads between. If one virtual service instance dies or deteriorates, then just automatically spin up another. If one route goes down, customer services are automatically re-directed via alternate routes and the service is maintained. No problem…
But there are some problems that can’t be solved in software. You can’t just use software to fix a cable that’s been cut by an excavator. You can’t just use software to fix failed electronics. Modern virtualised networks can do a great job of self-healing, routing around the problem areas. But there are still physical failures that need to be repaired / replaced / maintained by a field workforce. NSA doesn’t tend to cover that.
Looking at the diagram below, NSA does a great job of the closed-loop assurance within the red circle. But it then needs to kick out to the green closed-loop assurance processes that are already driven by our OSS/BSS.
As described in the link above, “Perhaps if the NSA was just assuring the yellow cloud/s, any time it identifies any physical degradation / failure in the resource pool, it kicks a notification up to the Customer Service Assurance (CSA) tools in the OSS/BSS layers? The OSS/BSS would then coordinate 1) any required customer notifications and 2) any truck rolls or fixes that can’t be achieved programmatically; just like it already does today. The additional benefit of this two-tiered assurance approach is that NSA can handle the NFV / VNF world, whilst not trying to replicate the enormous effort that’s already been invested into the CSA (ie the existing OSS/BSS assurance stack that looks after PNFs, other physical resources and the field workforce processes that look after it all).”
Therefore, a key part of the NSA process is how it kicks up from closed-loop 1 to closed-loop 2. Then, after closed-loop 2 has repaired the physical problem, NSA needs to be aware that the repaired resource is now back in the pool of available resources. Does your NSA automatically notice this, or must it receive a notification from closed loop 2?
It could be as simple as NSA sending alarms into the alarm list with a clearly articulate root-cause. The alarm has a ticket/s raised against it. The ticket triggers the field workforce to rectify it and the triggers customer assurance teams/tools to send notifications to impacted customers (if indeed they send notifications to customers who may not actually be effected yet due to the resilience measures that have kicked in). Standard OSS/BSS practice!
Let me start today with a question: Does your future OSS/BSS need to be drastically different to what it is today?
Please leave me a comment below, answering yes or no.
I’m going to take a guess that most OSS/BSS experts will answer yes to this question, that our future OSS/BSS will change significantly. It’s the reason I wrote the OSS Call for Innovation manifesto some time back. As great as our OSS/BSS are, there’s still so much need for improvement.
But big improvement needs big change. And big change is scary, as Tom Nolle points out:
“IT vendors, like most vendors, recognize that too much revolution doesn’t sell. You have to creep up on change, get buyers disconnected from the comfortable past and then get them to face not the ultimate future but a future that’s not too frightening.”
Do you feel like we’re already in the midst of a revolution? Cloud computing, web-scaling and virtualisation (of IT and networks) have been partly responsible for it. Agile and continuous integration/delivery models too.
The following diagram shows a “from the moon” level view of how I approach (almost) any new project.
The key to Tom’s quote above is in step 2. Just how far, or how ambitious, into the future are you projecting your required change? Do you even know what that future will look like? After all, the environment we’re operating within is changing so fast. That’s why Tom is suggesting that for many of us, step 2 is just a “creep up on it change.” The gap is essentially small.
The “creep up on it change” means just adding a few new relatively meaningless features at the end of the long tail of functionality. That’s because we’ve already had the most meaningful functionality in our OSS/BSS for decades (eg customer management, product / catalog management, service management, service activation, network / service health management, inventory / resource management, partner management, workforce management, etc). We’ve had the functionality, but that doesn’t mean we’ve perfected the cost or process efficiency of using it.
So let’s say we look at step 2 with a slightly different mindset. Let’s say we don’t try to add any new functionality. We lock that down to what we already have. Instead we do re-factoring and try to pull the efficiency levers, which means changes to:
Platforms (eg cloud computing, web-scaling and virtualisation as well as associated management applications)
Methodologies (eg Agile, DevOps, CI/CD, noting of course that they’re more than just methodologies, but also come with tools, etc)
Process (eg User Experience / User Interfaces [UX/UI], supply chain, business process re-invention, machine-led automations, etc)
It’s harder for most people to visualise what the Step 2 Future State looks like. And if it’s harder to envisage Step 2, how do we then move onto Steps 3 and 4 with confidence?
This is the challenge for OSS/BSS vendors, supplier, integrators and implementers. How do we, “get buyers disconnected from the comfortable past and then get them to face not the ultimate future but a future that’s not too frightening?” And I should point out, that it’s not just buyers we need to get disconnected from the comfortable past, but ourselves, myself definitely included.
Back in the old days, Network Service Assurance probably had a different meaning than it might today.
Clearly it’s assurance of a network service. That’s fairly obvious. But it’s in the definition of “network service” where the old and new terminologies have the potential to diverge.
In years past, telco networks were “nailed up” and network functions were physical appliances. I would’ve implied (probably incorrectly, but bear with me) that a “network service” was “owned” by the carrier and was something like a bearer circuit (as distinct from a customer service or customer circuit). Those bearer circuits, using protocols such as in DWDM, SDH, SONET, ATM, etc potentially carried lots of customer circuits so they were definitely worth assuring. And in those nailed-up networks, we knew exactly which network appliances / resources / bearers were being utilised. This simplified service impact analysis (SIA) and allowed targeted fault-fix.
In those networks the OSS/BSS was generally able to establish a clear line of association from customer service to physical resources as per the TMN pyramid below. Yes, some abstraction happened as information permeated up the stack, but awareness of connectivity and resource utilisation was generally retained end-to-end (E2E).
But in the more modern computer or virtualised network, it all goes a bit haywire, perhaps starting right back at the definition of a network service.
The modern “network service” is more aligned to ETSI’s NFV definition – “a composition of network functions and defined by its functional and behavioral specification. The Network Service contributes to the behaviour of the higher layer service, which is characterised by at least performance, dependability, and security specifications. The end-to-end network service behaviour is the result of a combination of the individual network function behaviours as well as the behaviours of the network infrastructure composition mechanism.”
They are applications running at OSI’s application layer that can be consumed by other applications. These network services include DNS, DHCP, VoIP, etc, but the concept of NaaS (Network as a Service) expands the possibilities further.
So now the customer services at the top of the pyramid (BSS / BML) are quite separated from the resources at the physical layer, other than to say the customer services consume from a pool of resources (the yellow cloud below). Assurance becomes more disconnected as a result.
OSS/BSS are able to tie customer services to pools of resources (the yellow cloud). And OSS/BSS tools also include PNI / WFM (Physical Network Inventory / Workforce Management) to manage the bottom, physical layer. But now there’s potentially an opaque gulf in the middle where virtualisation / NaaS exists.
The end-to-end association between customer services and the physical resources that carry them is lost. Unless we can find a way to establish E2E association, we just have to hope that our modern Network Service Assurance (NSA) tools make the yellow cloud robust to the point of infallibility. BTW. If the yellow cloud includes NaaS, then the NSA has to assure the NaaS gateway, catalog and all services instantiated through the gateway.
But as we know, there will always be failures in physical infrastructure (cable cuts, electronic malfunctions, etc). The individual resources can’t afford to be infallible, even if the resource pool seeks to provide collective resiliency.
Modern NSA has to find a way to manage the resource pool but also coordinate fault-fix in the physical resources that underpin it like the OSS used to do (still do??). They have to do more than just build policies and actions to ensure SLAs don’t they? They can seek to manage security, power, performance, utilisation and more. Unfortunately, not everything can be fixed programmatically, although that is a great place for NSA to start.
Perhaps if the NSA was just assuring the yellow cloud, any time it identifies any physical degradation / failure in the resource pool, it kicks a notification up to the Customer Service Assurance (CSA) tools in the OSS/BSS layers? The OSS/BSS would then coordinate 1) any required customer notifications and 2) any truck rolls or fixes that can’t be achieved programmatically; just like it already does today. The additional benefit of this two-tiered assurance approach is that NSA can handle the NFV / VNF world, whilst not trying to replicate the enormous effort that’s already been invested into the CSA (ie the existing OSS/BSS assurance stack that looks after PNFs, other physical resources and the field workforce processes that look after it all).
I’d love to hear your thoughts. Hopefully you can even correct me if/where I’m wrong.
As the title suggests above, NaaS has the potential to be as big a paradigm shift for networks (and OSS/BSS) as Agile has been for software development.
There are many facets to the Agile story, but for me one of the most important aspects is that it has taken end-to-end (E2E), monolithic thinking and has modularised it. Agile has broken software down into pieces that can be worked on by smaller, more autonomous teams than the methods used prior to it.
The same monolithic, E2E approach pervades the network space currently. If a network operator wants to add a new network type or a new product type/bundle, large project teams must be stood up. And these project teams must tackle E2E complexity, especially across an IT stack that is already a spaghetti of interactions.
But before I dive into the merits of NaaS, let me take you back a few steps, back into the past. Actually, for many operators, it’s not the past, but the current-day model.
As per the orange arrow, customers of all types (Retail, Enterprise and Wholesale) interact with their network operator through BSS (and possibly OSS) tools. [As an aside, see this recent post for a “religious war” discussion on where BSS ends and OSS begins]. The customer engagement occurs (sometimes directly, sometimes indirectly) via BSS tools such as:
Order Entry, Order Management
Product Catalog (Product / Offer Management)
SLA (Service Level Agreement) Management
If the customer wants a new instance of an existing service, then all’s good with the current paradigm. Where things become more challenging is when significant changes occur (as reflected by the yellow arrows in the diagram above).
For example, if any of the following are introduced, there are end-to-end impacts. They necessitate E2E changes to the IT spaghetti and require formation of a project team that includes multiple business units (eg products, marketing, IT, networks, change management to support all the workers impacted by system/process change, etc)
A new product or product bundle is to be taken to market
An end-customer needs a custom offering (especially in the case of managed service offerings for large corporate / government customers)
A new network type is added into the network
System and / or process transformations occur in the IT stack
If we just narrow in on point 3 above, fundamental changes are happening in network technology stacks already. Network virtualisation (SDN/NFV) and 5G are currently generating large investments of time and money. They’re fundamental changes because they also change the shape of our traditional OSS/BSS/IT stacks, as follows.
We now not only have Physical Network Functions (PNF) to manage, but Virtual Network Functions (VNF) as well. In fact it now becomes even more difficult because our IT stacks need to handle PNF and VNF concurrently. Each has their own nuances in terms of over-arching management.
The virtualisation of networks and application infrastructure means that our OSS see greater southbound abstraction. Greater southbound abstraction means we potentially lose E2E visibility of physical infrastructure. Yet we still need to manage E2E change to IT stacks for new products, network types, etc.
The diagram below shows how NaaS changes the paradigm. It de-couples the network service offerings from the network itself. Customer Facing Services (CFS) [as presented by BSS/OSS/NaaS] are de-coupled from Resource Facing Services (RFS) [as presented by the network / domains].
NaaS becomes a “meet-in-the-middle” tool. It effectively de-couples
The products / marketing teams (who generate customer offerings / bundles) from
The networks / operations teams (who design, build and maintain the network).and
The IT teams (who design, build and maintain the IT stack)
It allows product teams to be highly creative with their CFS offerings from the available RFS building blocks. Consider it like Lego. The network / ops teams create the building blocks and the products / marketing teams have huge scope for innovation. The products / marketing teams rarely need to ask for custom building blocks to be made.
You’ll notice that the entire stack shown in the diagram below is far more modular than the diagram above. Being modular makes the network stack more suited to being worked on by smaller autonomous teams. The yellow arrows indicate that modularity, both in terms of the IT stack and in terms of the teams that need to be stood up to make changes. Hence my claim that NaaS is to networks what Agile has been to software.
You will have also noted that NaaS allows the Network / Resource part of this stack to be broken into entirely separate network domains. Separation in terms of IT stacks, management and autonomy. It also allows new domains to be stood up independently, which accommodates the newer virtualised network domains (and their VNFs) as well as platforms such as ONAP.
The NaaS layer comprises:
A TMF standards-based API Gateway
A Master Services Catalog
A common / consistent framework of presentation of all domains
The ramifications of this excites me even more that what’s shown in the diagram above. By offering access to the network via APIs and as a catalog of services, it allows a large developer pool to provide innovative offerings to end customers (as shown in the green box below). It opens up the long tail of innovation that we discussed last week.
Some telcos will open up their NaaS to internal or partner developers. Others are drooling at the prospect of offering network APIs for consumption by the market.
You’ve probably already identified this, but the awesome thing for the developer community is that they can combine services/APIs not just from the telcos but any other third-party providers (eg Netflix, Amazon, Facebook, etc, etc, etc). I could’ve shown these as East-West services in the diagram but decided to keep it simpler.
Developers are not constrained to offering communications services. They can now create / offer higher-order services that also happen to have communications requirements.
If you weren’t already on board with the concept, hopefully this article has convinced you that NaaS will be to networks what Agile has been to software.
Agree or disagree? Leave me a comment below.
PS1. I’ve used the old TMN pyramid as the basis of the diagram to tie the discussion to legacy solutions, not to imply size or emphasis of any of the layers.
PS3. Similarly, the size of the NaaS layer is to bring attention to it rather than to imply it is a monolithic stack in it’s own right. In reality, it is actually a much thinner shim layer architecturally
PS4. The analogy between NaaS and Agile is to show similarities, not to imply that NaaS replaces Agile. They can definitely be used together
PS5. I’ve used the term IT quite generically (operationally and technically) just to keep the diagram and discussion as simple as possible. In reality, there are many sub-functions like data centre operations, application monitoring, application control, applications development, product owner, etc. These are split differently at each operator.
One of the longer lead-time items in relation to OSS data and processes is in network build and customer connections. From the time when capacity planning or a customer order creates the signal to build, it can be many weeks or months before the physical infrastructure work is complete and appearing in the OSS.
There are two financial downsides to this. Firstly, it tends to be CAPEX-heavy with equipment, construction, truck-rolls, government approvals, etc burning through money. Meanwhile, it’s also a period where there is no money coming in because the services aren’t turned on yet. The time-to-cash cycle of new build (or augmentation) is the bane of all telcos.
This is one of the exciting aspects of network virtualisation for telcos. In a time where connectivity is nearly ubiquitous in most countries, often with high-speed broadband access, physical build becomes less essential (except over-builds). Technologies such as uCPE (Universal Customer Premises Equipment), NFV (Network Function Virtualisation), SD WAN (Software-Defined Wide Area Networks), SDN (Software Defined Networks) and others mean that we can remotely upgrade and reconfigure the network without field work.
Network virtualisation gives the potential to speed up many of the slowest, and costliest processes that run through our OSS… but only if our OSS can support efficient orchestration of virtualised networks. And that means having an OSS with the flexibility to easily change out slow processes to replace them with fast ones without massive overhauls.
One popular approach is to build a proof-of-concept or sandpit quickly on cloud hosting or in lab environments. It’s fast for a number of reasons including reduced number of approvals, faster activation of infrastructure, reduced safety checks (eg security, privacy, etc), minimised integration with legacy systems and many other reasons. The cloud hosting business model is thriving for all of these reasons.
However, it’s one thing to speed up development of an OSS PoC and another entirely to speed up deployment to a PROD environment. As soon as you wish to absorb the PoC-proven solution back into PROD, all the items listed above (eg security sign-offs) come back into play. Something that took days/weeks to stand up in PoC now takes months to productionise.
Have you noticed that the safety checks currently being used were often defined for the old world? They often aren’t designed with transition from cloud to PROD in mind. Similarly, the culture of design cross-checks and approvals can also be re-framed (especially when the end-to-end solution crosses multiple different business units). Lastly, and way outside my locus of competence, is in re-visiting security / privacy / deployment / etc models to facilitate easier transition.
One consideration to make is just how much absorption is required. For example, there are examples of services being delivered to the large entity’s subscribers by a smaller, external entity. The large entity then just “clips-the-ticket,” gaining a revenue stream with limited involvement. But the more common (and much more challenging) absorption model is for the partner to fold the solution back into the large entity’s full OSS/BSS stack.
So let’s consider your opportunity in terms of the absorption continuum that ranges between:
Perhaps it’s feasible for your opportunity to fit somewhere in between (partially absorbed)? Perhaps part of that answer resides in the cloud model you decide to use (public, private, hybrid, cloud-managed private cloud) as well as the partnership model?
Modularity and reduced complexity (eg integrations) are also a factor to consider (as always).
I haven’t seen an ideal response to the absorption challenge yet, but I believe the solution lies in re-framing corporate culture and technology stacks. We’ll look at that in more detail tomorrow.
How about you? Have you or your organisation managed to speed up your transition from PoC to PROD? What techniques have you found to be successful?
As the TMN diagram below describes, each layer up in the network management stack abstracts but connects (as described in more detail in “What an OSS shouldn’t do“). That is, each higher layer reduces the amount if information/control within a domain that it’s responsible for, but it assumes a broader responsibility for connecting multiple domains together.
There’s just one problem with the diagram. It’s a little dated when we take modern virtualised infrastructure into account.
In the old days, despite what the layers may imply, it was common for an OSS to actually touch every layer of the pyramid to resolve faults. That is, OSS regularly connected to NMS, EMS and even devices (NE) to gather network health data. The services defined at the top of the stack (BSS) could be traced to the exact devices (NE / NEL) via the circuits that traversed them, regardless of the layers of abstraction. It helped for root-cause analysis (RCA) and service impact analysis (SIA).
But with modern networks, the infrastructure is virtualised, load-balanced and since they’re packet-switched, they’re completely circuitless (I’m excluding virtual circuits here by the way). The bottom three layers of the diagram could effectively be replaced with a cloud icon, a cloud that the OSS has little chance of peering into (see yellow cloud in the diagram later in this post).
The concept of virtualisation adds many sub-layers of complexity too by the way, as higlighted in the diagram below.
So now the customer services at the top of the pyramid (BSS / BML) are quite separated from the resources at the bottom, other than to say the services consume from a known pool of resources. Fault resolution becomes more abstracted as a result.
But what’s interesting is that there’s another layer that’s not shown on the typical TMN model above. That is the physical network inventory (PNI) layer. The cables, splices, joints, patch panels, equipment cards, etc that underpin every network. Yes, even virtual networks.
In the old networks the OSS touched every layer, including the missing layer. That functionality was provided by PNI management. Fault resolution also occurred at this layer through tickets of work conducted by the field workforce (Workforce Management – WFM).
In new networks, OSS/BSS tie services to resource pools (the top two layers). They also still manage PNI / WFM (the bottom, physical layer). But then there’s potentially an invisible cloud in the middle. Three distinctly different pieces, probably each managed by a different business unit or operational group.
Just wondering – has your OSS/BSS developed control anxiety issues from losing some of the control that it once had?
The advertisement includes the following text:
“Amazon Web Services (AWS) is leading the next paradigm shift in computing and is looking for a world class candidate to manage an elite portfolio of strategic AWS technology partners focused on the Operation support System (OSS) and Business Support System (BSS) applications within telecommunications segment. Your job will be to use these strategic partners to develop OSS and BSS applications on AWS infrastructure and platform.”
How do you read this advertisement? I have a few different perspectives to pose to you:
I can’t predict AWS’ future success with this initiative, but I’m assuming they’re creating the role because they see a big opportunity that they wish to capture. They have plenty of places they could otherwise invest, so they must believe the opportunity is big (eg the industry of OSS suppliers selling to CSPs is worth multi-billions of dollars and is waiting to be disrupted).
OSS/BSS are typically seen by CSPs as a very expensive (and risky) cost of doing business. I’m certain there’s a business model for any organisation (possibly AWS and its tech partners) that can significantly improve the OSS/BSS delivery costs/risks for CSPs.
The ad identifies CSPs (specifically the term, “major telecom infrastructure providers”) as the target customer. You could pose the concept that the CSPs won’t want to support a competitor in AWS. The CSPs I’m dealing with can’t get close to matching AWS cost structures so are partnering with AWS etc. Not just for private cloud, but also public and hybrid cloud too. The clip-the-ticket / partnership selling model appears to be becoming more common for telcos globally, so the fear-of-competition barrier “seems” to be coming down a little.
The other big challenge facing the role is network and data security. What’s surprised me most are core network services like directory services (used for internal authentication/AAA purposes). I never thought I’d see these outsourced to third-party cloud providers, but have seen the beginnings of it recently. If CSPs consume those, then OSS/BSS must be up for grabs at some CSPs too. For example, I’d imagine that OSS/BSS tools were amongst the 1,000 business apps that Verizon is moving to AWS.
The really interesting future consideration could be the advanced innovation that AWS et al could bring to the OSS space, and in ways that the telcos and OSS suppliers simply can’t. This recent post showed Google’s intent to bring AI to network operations. It could revolutionise the OSS/BSS industry. Not just for CSPs, but for their customers as well (eg their enterprise-grade OSS). Could it even represent another small step towards the OSS Doomsday Scenario posed here?
This is the fourth, and final part (I think) in the series on killing the OSS RFI/RFP process, a process that suppliers and customers alike find to be inefficient. The concept is based on an initiative currently being investigated by TM Forum.
The previous three posts focused on the importance of trusted partnerships and the methods to develop them via OSS procurement events.
Today’s post takes a slightly different tack. It proposes a structural obsolescence that may lead to the death of the RFP. We might not have to kill it. It might die a natural death.
Actually, let me take that back. I’m sure RFPs won’t die out completely as a procurement technique. But I can see a time when RFPs are far less common and significantly different in nature to today’s procurement events.
That’s the answer all technologists cite to any form of problem of course. But there’s a growing trend that provides a portent to the future here.
It comes via the XaaS (As a Service) model of software delivery. We’re increasingly building and consuming cloud-native services. OSS of the future, the small-grid model, are likely to consume software as services from multiple suppliers.
And rather than having to go through a procurement event like an RFP to form each supplier contract, the small grid model will simply be a case of consuming one/many services via API contracts. The API contract (eg OpenAPI specification / swagger) will be available for the world to see. You either consume it or you don’t. No lengthy contract negotiation phase to be had.
Now as mentioned above, the RFP won’t die, but evolve. We’ll probably see more RFPs formed between customers and the services companies that will create customised OSS solutions (utilising one/many OSS supplier services). And these RFPs may not be with the massive multinational services companies of today, but increasingly through smaller niche service companies. These micro-RFPs represent the future of OSS work, the gig economy, and will surely be facilitated by smart-RFP / smart-contract models (like the OSS Justice League model).
I wonder if we’re reaching the point where “telecommunication services” is no longer a relevant term? By association, SLAs are also a bust. But what are they replaced by?
A telecommunication service used to effectively be the allocation of a carrier’s resources for use by a specific customer. Now? Well, less so
Service consumption channel alternatives are increasing, from TV and radio; to PC, to mobile, to tablet, to YouTube, to Insta, to Facebook, to a million others.
Consumption sources are even more prolific.
Customer contact channel alternatives are also increasing, from contact centres; to IVR, to online, to mobile apps, to Twitter, etc.
A service bundle often utilises third-party components, some of which are “off-net”
Virtualisation is increasingly abstracting services from specific resources. They’re now loosely coupled with resource pools and rely on high availability / elasticity to ensure customer service continuity. Not only that, but those resource pools might extend beyond the carrier’s direct control and out to cloud provider infrastructure
The growing variant-tree is taking the concept beyond the reach of “customer services” and evolves to become “customer experiences.”
The elements that made up a customer service in the past tended to fall within the locus of control of a telco and its OSS. The modern customer experience extends far beyond the control of any one company or its OSS. An SLA – Service Level Agreement – only pertains to the sub-set of an experience that can be measured by the OSS. We can aspire to offer an ELA – Experience Level Agreement – because we don’t have the mechanisms by which to measure or manage the entire experience yet.
The metrics that matter most for telcos today tend to revolve around customer experience (eg NPS). But aside from customer surveys, ratings and derived / contrived metrics, we don’t have electronic customer experience measurements.
Customer services are dead; Long live the customer experiences king… if only we can invent a way to measure the whole scope of what makes up customer experiences.
The left-hand panel of the triptych below shows the current state of interactions with most OSS. There are hundreds of variants inbound via external sources (ie multi-channel) and even internal sources (eg different service types). Similarly, there are dozens of networks (and downstream systems), each with different interface models. Each needs different formatting and integration costs escalate.
The intent model of network provisioning standardises the network interface, drastically simplifying the task of the OSS and the variants required for it to handle. This becomes particularly relevant in a world of NFVs, where it doesn’t matter which vendor’s device type (router say) can be handled via a single command intent rather than having separate interfaces to each different vendor’s device / EMS northbound interface. The unique aspects of each vendor’s implementation are abstracted from the OSS.
The next step would be in standardising the interface / data model upstream of the OSS. That’s a more challenging task!!
“ONAP provides a comprehensive platform for real-time, policy-driven orchestration and automation of physical and virtual network functions that will enable software, network, IT and cloud providers and developers to rapidly automate new services and support complete lifecycle management.
By unifying member resources, ONAP is accelerating the development of a vibrant ecosystem around a globally shared architecture and implementation for network automation–with an open standards focus–faster than any one product could on its own.”
Part of the ONAP charter from onap.org.
The ONAP project is gaining attention in service provider circles. The Steering Committee of the ONAP project hints at the types of organisations investing in the project. The statement above summarises the mission of this important project. You can bet that the mission has been carefully crafted. As such, one can assume that it represents what these important stakeholders jointly agree to be the future needs of their OSS.
I find it interesting that there are quite a few technical terms (eg policy-driven orchestration) in the mission statement, terms that tend to pre-empt the solution. However, I don’t feel that pre-emptive technical solutions are the real mission, so I’m going to try to reverse-engineer the statement into business needs. Hopefully the business needs (the “why? why? why?” column below) articulates a set of questions / needs that all OSS can work to, as opposed to replicating the technical approach that underpins ONAP.
Why? Why? Why?
The ability to make instantaneous decisions
Why1: To adapt to changing conditions
Why2: To take advantage of fleeting opportunities or resolve threats
Why 3: To optimise key business metrics such as financials
Why 4: As CSPs are under increasing pressure from shareholders to deliver on key metrics
To use policies to increase the repeatability of key operational processes
Why 1: Repeatability provides the opportunity to improve efficiency, quality and performance
Why 2: Allows an operator to service more customers at less expense
Why 3: Improves corporate profitability and customer perceptions
Why 4: As CSPs are under increasing pressure from shareholders to deliver on key metrics
To use policies to increase the amount of automation that can be applied to key operational processes
Why 1: Automated processes provide the opportunity to improve efficiency, quality and performance
Why 2: Allows an operator to service more customers at less expense
Why 3: Improves corporate profitability and customer perceptions
physical and virtual network functions
Our networks will continue to consist of physical devices, but we will increasingly introduce virtualised functionality
Why 1: Physical devices will continue to exist into the foreseeable future but virtualisation represents an exciting approach into the future
Why 2: Virtual entities are easier to activate and manage (assuming sufficient capacity exists)
Why 3: Physical equipment supply, build, deploy and test cycles are much longer and labour intensive
Why 4: Virtual assets are more flexible, faster and cheaper to commission
Why 5: Customer services can be turned up faster and cheaper
software, network, IT and cloud providers and developers
With this increase in virtualisation, we find an increasingly large and diverse array of suppliers contributing to our value-chain. These suppliers contribute via software, network equipment, IT functions and cloud resources
Why 1: CSPs can access innovation and efficiency occurring outside their own organisation
Why 2: CSPs can leverage the opportunities those innovations provide
Why 3: CSPs can deliver more attractive offers to customers
Why 4: Key metrics such as profitability and customer satisfaction are enhanced
rapidly automate new services
We want the flexibility to introduce new products and services far faster than we do today
Why 1: CSPs can deliver more attractive offers to customers faster than competitors
Why 2: Key metrics such as market share, profitability and customer satisfaction are enhanced as well as improved cashflow
support complete lifecycle management
The components that make up our value-chain are changing and evolving so quickly that we need to cope with these changes without impacting customers across any of their interactions with their service
Why 1: Customer satisfaction is a key metric and a customer’s experience spans the entire lifecyle of their service.
Why 2: CSPs don’t want customers to churn to competitors
Why 3: Key metrics such as market share, profitability and customer satisfaction are enhanced
unifying member resources
To reduce the amount of duplicated and under-synchronised development currently being done by the member bodies of ONAP
Why 1: Collaboration and sharing reduces the effort each member body must dedicate to their OSS
Why 2: A reduced resource pool is required
Why 3: Costs can be reduced whilst still achieving a required level of outcome from OSS
To increase the level of supplier interchangability
Why 1: To reduce dependence on any supplier/s
Why 2: To improve competition between suppliers
Why 3: Lower prices, greater choice and greater innovation tend to flourish in competitive environments
Why 4: CSPs, as customers of the suppliers, benefit
globally shared architecture
To make networks, services and support systems easier to interconnect across the global communications network
Why 1: Collaboration on common standards reduces the integration effort between each member at points of interconnect
Why 2: A reduced resource pool is required
Why 3: Costs can be reduced whilst still achieving interconnection benefits
As indicated in earlier posts, ONAP is an exciting initiative for the CSP industry for a number of reasons. My fear for ONAP is that it becomes such a behemoth of technical complexity that it becomes too unwieldy for use by any of the member bodies. I use the analogy of ATM versus Ethernet here, where ONAP is equivalent to ATM in power and complexity. The question is whether there’s an Ethernet answer to the whys that ONAP is trying to solve.
I’d love to hear your thoughts.
(BTW. I’m not saying that the technologies the ONAP team is investigating are the wrong ones. Far from it. I just find it interesting that the mission is starting with a technical direction in mind. I see parallels with the OSS radar analogy.)