Zero touch network & Service Management (ZSM) is a next-gen network management approach using closed-loop principles hosted by ETSI. An ETSI blog has just demonstrated the first ZSM Proof of Concept (PoC). The slide deck describing the PoC, supplied by EnterpriseWeb, can be found here.
The diagram below shows a conceptual closed-loop assurance architecture used within the PoC
It contains some similar concepts to a closed-loop traffic engineering project designed by PAOSS back in 2007, but with one big difference. That 2007 project was based on a single-vendor solution, as opposed to the open, multi-vendor PoC demonstrated here. Both were based on the principle of using assurance monitors to trigger fulfillment responses. For example, ours used SLA threshold breaches on voice switches to trigger automated remedial response through the OSS‘s provisioning engine.
For this newer example, ETSI’s blog details, “The PoC story relates to a congestion event caused by a DDoS (Denial of Service) attack that results in a decrease in the voice quality of a network service. The fault is detected by service monitoring within one or more domains and is shared with the end-to-end service orchestrator which correlates the alarms to interpret the events, based on metadata and metrics, and classifies the SLA violations. The end-to-end service orchestrator makes policy-based decisions which trigger commands back to the domain(s) for remediation.”
You’ll notice one of the key call-outs in the diagram above is real-time inventory. That was much harder for us to achieve back in 2007 than it is now with virtualised network and compute layers providing real-time telemetry. We used inventory that was only auto-discovered once daily and had to build in error handling, whilst relying on over-provisioned physical infrastructure.
It’s exciting to see these types of projects being taken forward by ETSI, EnterpriseWeb, et al.
Many years ago, I made a data migration blunder that slowed a production OSS down to a crawl. Actually, less than a crawl. It almost became unusable.
I was tasked with creating a production database of a carrier’s entire network inventory, including data migration for a bunch of Nortel Passport ATM switches (yes, it was that long ago).
There were around 70 of these devices in the network
14 usable slots in each device (ie slots not reserved for processing, resilience, etc)
Depending on the card type there were different port densities, but let’s say there were 4 physical ports per slot
Up to 2,000 VPIs per port
Up to 65,000 VCIs per VPI
The customer was running SPVC
To make it easier for the operator to create a new customer service, I thought I should script-create every VPI/VCI on every port on every devices. That would allow the operator to just select any available VPI/VCI from within the OSS when provisioning (or later, auto-provisioning) a service.
There was just one problem with this brainwave. For this particular OSS, each VPI/VCI represented a logical port that became an entry alongside physical ports in the OSS‘s ports table… You can see what’s about to happen can’t you? If only I could’ve….
My script auto-created nearly 510 billion VCI logical ports; over 525 billion records in the ports table if you also include VPIs and physical ports…. in a production database. And that was just the ATM switches!
So instead of making life easier for the operators, it actually brought the OSS‘s database to a near stand-still. Brilliant!!
Luckily for me, it was a greenfields OSS build and the production database was still being built up in readiness for operational users to take the reins. I was able to strip all the ports out and try again with a less idiotic data creation plan.
The reality was that there’s no way the customer could’ve ever used 2,000 x 65,000 VPI/VCI groupings I’d created on every single physical port. Put it this way, there were far less than 130 million services across all service types across all carriers across that whole country!
Instead, we just changed the service activation process to manually add new VPI/VCIs into the database on demand as one of the pre-cursor activities when creating each new customer service.
“One business customer, for example, may require ultra-reliable services, whereas other business customers may need ultra-high-bandwidth communication or extremely low latency. The 5G network needs to be designed to be able to offer a different mix of capabilities to meet all these diverse requirements at the same time.
From a functional point of view, the most logical approach is to build a set of dedicated networks each adapted to serve one type of business customer. These dedicated networks would permit the implementation of tailor-made functionality and network operation specific to the needs of each business customer, rather than a one-size-fits-all approach as witnessed in the current and previous mobile generations which would not be economically viable.
A much more efficient approach is to operate multiple dedicated networks on a common platform: this is effectively what “network slicing” allows. Network slicing is the embodiment of the concept of running multiple logical networks as virtually independent business operations on a common physical infrastructure in an efficient and economical way..”
GSMA’s Introduction to Network Slicing.
Engineering a network is one of compromises. There are many different optimisation levers to pull to engineer a set of network characteristics. In the traditional network, it was a case of pulling all the levers to find a middle-ground set of characteristics that supported all their service offerings.
QoS striping of traffic allowed for a level of differentiation of traffic handling, but the underlying network was still a balancing act of settings. Network virtualisation offers new opportunities. It allows unique segmentation via virtual networks, where each can be optimised for the specific use-cases of that network slice.
For years, I’ve been posing the concept of telco offerings being like electricity networks – that we don’t need so many service variants. I should note that this analogy is not quite right. We do have a few different types of “electricity” such as highly available (health monitoring), high-bandwidth (content streaming), extremely low latency (rapid reaction scenarios such as real-time sensor networks), etc.
Now what do we need to implement and manage all these network slices?? Oh that’s right, OSS! It’s our OSS that will help to efficiently coordinate all the slicing and dicing that’s coming our way… to optimise all the levers across all the different network slices!
Many years ago, I was lucky enough to lead a team responsible for designing a complex inside and outside plant network in a massive oil and gas precinct. It had over 120 buildings and more than 30 networked systems.
We were tasked with using CAD (Computer Aided Design) and Office tools to design the comms and security solution for the precinct. And when I say security, not just network security, but building access control, number plate recognition, coast guard and even advanced RADAR amongst other things.
One of the cool aspects of the project was that it was more three-dimensional than a typical telco design. A telco cable network is usually planned on x and y coordinates because the y coordinate is usually on one or two planes (eg all ducts are at say 0.6m below ground level or all catenary wires between poles are at say 5m above ground). However, on this site, cable trays ran at all sorts of levels to run around critical gas processing infrastructure.
We actually proposed to implement a light-weight OSS for management of the network, including outside plant assets, due to the easy maintainability compared with CAD files. The customer’s existing CAD files may have been perfect when initially built / handed-over, but were nearly useless to us because of all the undocumented that had happened in the ensuing period. However, the customer was used to CAD files and wanted to stay with CAD files.
This led to another cool aspect of the project – we had to build out defacto OSS data models to capture and maintain the designs.
The support plane (trayway, ducts, sub-ducts, trenches, lead-ins, etc)
That got me wondering what other proxy metrics might be used by to provide predictive indicators of what’s happening in your network, OSS and/or BSS. Apparently, “Colt aims to enhance its service assurance capabilities by taking non-traditional data (signal strength, power, temperature, etc.) from network elements (cards, links, etc.) to predict potential faults,” according to James Crawshawhere on LightReading.
What about environmental metrics like humidity, temperature, movement, power stability/disturbance?
I’d love to hear about what proxies you use or what unexpected metrics you’ve found to have shone the spotlight on friction in your organisation.
NPS, or Net Promoter Score, has become commonly used in the telecoms industry in recent years. In effect, it is a metric that measures friction in the business. If NPS is high, the business runs more smoothly. Customers are happy with the service and want to buy more of it. They’re happy with the service so they don’t need to contact the business. If NPS is low, it’s harder to make sales and there’s the additional cost of time dealing with customer complaints, etc (until the customer goes away of course).
NPS can be easy to measure via survey, but a little more challenging as a near-real-time metric. What if we used customer contacts (via all channels such as phone, IVR, email, website, live-chat, etc) as a measure of friction? But more importantly, how does any of this relate to OSS / BSS? We’ll get to that shortly (I hope).
BSS (billing, customer relationship management, etc) and OSS (service health, network performance, etc) tend to be the final touchpoints of a workflow before reaching a customer. When the millions of workflows through a carrier are completing without customer contact, then friction is low. When there are problems, calls go up and friction / inefficiency is also going up. When there are problems, the people (or systems) dealing with the calls (eg contact centre operators) tend to start with OSS / BSS tools and then work their way back up the funnel to identify the cause of friction and attempt to resolve it.
The problem is that the OSS / BSS tools are often seen as the culprit because that’s where the issue first becomes apparent. It’s easier to log an issue against the OSS than to keep tracking back to the real source of the problem. Many times, it’s a case of shooting the messenger. Not only that, but if we’re not actually identifying the source of the problem then it becomes systemic (ie the poor customer experience perpetuates).
Maybe there’s a case for us to get better at tracking the friction caused further upstream of our OSS / BSS and to give more granular investigative tools to the call takers. Even if we do, our OSS / BSS are still the ones delivering the message.
OSS tend to be very good at presenting a current moment in time – the current configuration of the network, the health of the network, the activities underway.
Some (but not all) tend to struggle to cope with other moments in time – past and future.
Most have tools that project into the future for the purpose of capacity planning, such as link saturation estimation (based on projecting forward from historical trend-lines). Predictive analytics is a current buzz-word as research attempts to predict future events and mitigate for them now.
Most also have the ability to look into the past – to look at historical logs to give an indication of what happened previously. However, historical logs can be painful and tend towards forensic analysis. We can generally see who (or what) performed an action at a precise timestamp, but it’s not so easy to correlate the surrounding context in which that action occurred. They rarely present a fully-stitched view in the OSS GUI that shows the state of everything else around it at that snapshot in time past. At least, not to the same extent that the OSS GUI can stitch and present current state together.
But the scenario that I find most interesting is for the purpose of network build / maintenance planning. Sometimes these changes occur as isolated events, but are more commonly run as projects, often with phases or milestone states. For network designers, it’s important to differentiate between assets (eg cables, trenches, joints, equipment, ports, etc) that are already in production versus assets that are proposed for installation in the future.
And naturally those states cross over at cut-in points. The proposed new branch of the network needs to connect to the existing network at some time in the future. Designers need to see available capacity now (eg spare ports), but be able to predict with confidence that capacity will still be available for them in the future. That’s where the “reserved” status comes into play, which tends to work for physical assets (eg physical ports) but can be more challenging for logical concepts like link utilisation.
In large organisations, it can be even more challenging because there’s not just one augmentation project underway, but many. In some cases, there can be dependencies where one project relies on capacity that is being stood up by other future projects.
Not all of these projects / plans will make it into production (eg funding is cut or a more optimal design option is chosen), so there is also the challenge of deprecating planned projects. Capability is required to find whether any other future projects are dependent on this deprecated future project.
It can get incredibly challenging to develop this time/space matrix in OSS. If you’re a developer of OSS, the question becomes whether you want to take the blue or red pill.
Today we look at the front-loading benefits of building OSS / network auto-discovery tools.
We all know that OSS are only as good as the data we seed them with. As the old saying goes, garbage in, garbage out.
Assurance / network-health data is generally collected directly from the network in near real time, typically using SNMP traps, syslog events and similar. The network is generally the data master for this assurance-style data, so it makes sense to pull data from the network wherever possible (ie bottom-up data flows).
Fulfilment data, in the form of customer orders, network designs, etc are often captured in external systems first (ie as master) and pushed into the network as configurations (ie top-down data flows).
These two flows meet in the middle, as they both tend to rely on network inventory and/or resources. Bottom-up – Network faults to be tied to inventory / resource / topology information (eg fibre cuts, port failure, device failure, etc) which are important for fault identification (Root Cause Analysis [RCA]). Similarly for top-down – customer services / circuits / contracts tend to consume inventory, capacity and/or other resources.
Looking end-to-end and correlating network health (assurance) to customer service health (fulfilment) (eg as Service Level Agreement [SLA] analysis, Service Impact Analysis [SIA]) tends to only be possible due to reconciliation via inventory / resource data sets as linking keys.
Seeding an OSS with inventory / resource data can be done via three methods:
Data migration (eg script-loading from external sources such as spreadsheet or CSV files)
Manual data creation
Auto-discovery (ie collection of data directly from the network)
(or a combination of the above)
Options 1 and 2 are probably the more traditional method of initial seeding of OSS databases, mainly because they tend to be faster to demonstrate progress.
Option 3 is the front-loading option that can be challenging in the short-to-medium term but will hopefully prove beneficial in the longer term (just like knowledge transfer and automated testing).
It might seem easy to just suck data directly out of the network, but the devil is definitely in the detail, details such as:
Choosing optimal timings to poll the network without saturating it (if notification aren’t being pushed by the network), not to mention session handling of these long-running data transfers
Building the mediation layer to perform protocol conversion and data mappings
Field translation / mapping to common naming standards. This can be much more difficult than it sounds and is key to allow the assurance and fulfilment meet-in-the-middle correlations to occur (as described above)
Reconciliation between the data being presented by the network and what’s already in the OSS database. In theory, the data presented by the network should always be right, but there are scenarios where it’s not (eg flapping ports giving the appearance of assets being present / absent from network audit data, assets in test / maintenance mode that aren’t intended be accepted into the OSS inventory pool, lifecycle transitions from planned to built, etc)
Discrepancy and exception handling rules
All of the above make it challenging for “siloed” data sets, but perhaps even more challenging is in the discovery and auto-stitching of cross-domain data sets (eg cross-domain circuits / services / resource chains)
The vexing question arises – do you front-load and seed via auto-discovery or perform a creation / migration that requires more manual intervention? In some cases, it could even be a combination of both as some domains are easier to auto-discover than others.
You all know I’m a fan of training operators in OSS sandpits (and as apprenticeships during the build phase) rather than a week or two of classroom training at the end of a project.
To reduce the re-work in building a sandpit environment, which will probably be a dev/test environment rather than a production environment, I like to go all the way back to the vendor selection process.
Running a Proof of Concept (PoC) is a key element of vendor selection in my opinion. The PoC should only include a small short-list of pre-selected solutions so as to not waste time of operator or vendor / integrator. But once short-listed, the PoC should be a cut-down reflection of the customer’s context. Where feasible, it should connect to some real devices / apps (maybe lab devices / apps, possibly via a common/simple interface like SNMP). This takes some time on both sides to set up, but it shows how easily (or not) the solution can integrate with the customer’s active network, BSS, etc. It should be specifically set up to show the device types, alarm types, naming conventions, workflows, etc that fit into the customer’s specific context. That allows the customer to understand the new OSS in terms they’re familiar with.
And since the effort has been made to set up the PoC, doesn’t it make sense to make further use of it and not just throw it away? If the winning bidder then leaves the PoC environment in the hands of the customer, it becomes the sandpit to play in. The big benefit for the winning bidder is that hopefully the customer will have less “what if?” questions that distract the project team during the implementation phase. Questions can be demonstrated, even if only partially, using the sandpit environment rather than empty words.
Have you noticed that OSS projects need to go through extensive review to get funding of business cases? That makes sense. They tend to be a big investment after all. Many OSS projects fail, so we want to make sure this one doesn’t and we perform thorough planing / due-diligence.
But I do find it interesting that we spend less time and effort on Post Implementation Reviews (PIRs). We might do the review of the project, but do we compare with the Cost Benefit Analysis (CBA) that undoubtedly goes into each business case?
Even more interesting is that we spend even less time and effort performing ongoing analysis of an implemented OSS
against against the CBA.
Why interesting? Well, if we took the time to figure out what has really worked, we might have better (and more persuasive) data to improve our future business cases. Not only that, but more chance to reduce the effort on the business case side of the scale compared with current approaches (as per diagrams above).
Something curious dawned on me the other day – I wondered how many people / organisations actively manage alarms / alerts being generated by their lab equipment?
At first glance, this would seem silly. Lab environments are in constant flux, in all sorts of semi-configured situations, and therefore likely to be alarming their heads off at different times.
As such, it would seem even sillier to send alarms, syslogs, performance counters, etc from your lab boxes through to your production alarm management platform. Events on your lab equipment simply shouldn’t be distracting your NOC teams from handling events on your active network right?
But here’s why I’m curious, in a word, DevOps. Does the DevOps model now mean that some lab equipment stays in a relatively stable state and performs near-mission-critical activities like automated regression testing?? Therefore, does it make sense to selectively monitor / manage some lab equipment?
In what other scenarios might we wish to send lab alarms to the NOC (and I’m not talking about system integration testing type scenarios, but ongoing operational scenarios)?
There’s a lot of excitement about what machine-led decisioning can introduce into the world of network operations, and rightly so. Excitement about predictions, automation, efficiency, optimisation, zero-touch assurance, etc.
There are so many use-cases that disruptors are proposing to solve using Artificial Intelligence (AI), Machine Learning (ML) and the like. I might have even been guilty of proposing a few ideas here on the PAOSS blog – ideas like closed-loop OSS learning / optimisation of common processes, the use of Robotic Process Automation (RPA) to reduce swivel-chairing and a few more.
But have you ever noticed that most of these use-cases are forward-looking? That is, using past data to look into the future (eg predictions, improvements, etc). What if we took a slightly different approach and used past data to reconcile what we’ve just done?
AI and ML models tend to rely on lots of consistent data. OSS produce or collect lots of consistent data, so we get a big tick there. The only problem is that a lot of our manually created OSS data is riddled with inconsistency, due to nuances in the way that each worker performs workflows, data entry, etc as well as our own personal biases and perceptions.
For example, zero-touch automation has significant cachet at the moment. To achieve that, we need network health indicators (eg alarms, logs, performance metrics), but also a record of interventions that have (hopefully) improved on those indicators. If we have inconsistent or erroneous trouble-ticketing information, our AI is going to struggle to figure out the right response to automate. Network operations teams tend to want to fix problems quickly, get the ticketing data populated as fast as possible and move on to the next fault, even if that means creating inconsistent ticketing data.
So, to get back to looking backwards, a machine-learning use-case to consider today is to look at what an operator has solved, compare it to past resolutions and either validate the accuracy of their resolution data or even auto-populate fields based on the operator’s actions.
Hat tip to Jay Fenton for the idea seed for this post.
“Many systems are moving beyond simple virtualization and are being run on dynamic private or even public clouds. CSPs will migrate many to hybrid clouds because of concerns about data security and regulations on where data are stored and processed.
We believe that over the next 15 years, nearly all software systems will migrate to clouds provided by third parties and be whatever cloud native becomes when it matures. They will incorporate many open-source tools and middleware packages, and may include some major open-source platforms or sub-systems (for example, the size of OpenStack or ONAP today).”
Dr Mark H Mortensen in an article entitled, “BSS and OSS are moving to the cloud: Analysys Mason” on Telecom Asia.
Dr Mortensen raises a number of other points relating to cloud models for OSS and BSS in the article linked above. Included is definition of various cloud / virtualisation related terms.
He also rightly points out that many OSS / BSS vendors are seeking to move to virtualised / cloud / as-a-Service delivery models (for reasons including maintainability, scalability, repeatability and other “ilities”).
The part that I find interesting with cloud models (I’ll use the term generically) is positioning of the security control point(s). Let’s start by assuming a scenario where:
The Active Network (AN) is “on-net” – the network that carries live customer traffic (the routers, switches, muxes, etc) is managed by the CSP / operator [Noting though, that these too are possibly managed as virtual entities rather than owned].
The “cloud” OSS/BSS is “off-net” – some vendors will insist on their multi-tenanted OSS/BSS existing within the public cloud
The diagram below shows three separate realms:
The OSS/BSS “in the cloud”
The operator’s enterprise / DC realm
The operator’s active network realm
as well as the Security Control Points (SCPs) between them.
The most important consideration of this architecture is that the Active Network remains operational (ie carry customer traffic) even if the link to the DC and/or the link to the cloud is lost.
With that in mind, our second consideration is what aspects of network management need to reside within the AN realm. It’s not just the Active Network devices, but anything else that allows the AN to operate in an isolated state. This means that shared services like NTP / synch needs a presence in the AN realm (even if not of the highest stratum within the operator’s time-synch solution).
What about Element Managers (EMS) that look after the AN devices? How about collectors / probes? How about telemetry data stores? How about network health management tools like alarm and performance management? How about user access management (LDAP, AD, IAM, etc)? Do they exist in the AN or DC realm?
Then if we step up the stack a little to what I refer to as East-West OSS / BSS tools like ticket management, workforce management, even inventory management – do we collect, process, store and manage these within the DC or are we prepared to shift any of this functionality / data out to the cloud? Or do we prefer it to remain in the AN realm and ensure only AN privileged users have access?
Which OSS / BSS tools remain on-net (perhaps as private cloud) and which can (or must) be managed off-net (public cloud)?
Climb further up the stack and we get into the interesting part of cloud offerings. Not only do we potentially have the OSS/BSS (including East-West tools), but more excitingly, we could bring in services or solutions like content from external providers and bundle them with our own offerings.
We often hear about the tight security that’s expected (and offered) as part of the vendor OSS/BSS cloud solutions, but as you see, the tougher consideration for network management architects is actually the end-to-end security and where to position the security control points relative to all the pieces of the OSS/BSS stack.
“In the past, business-oriented groups have had ideas about what they want to do and then they come to us… Now, they want to know what technology can bring to the table and then they’ll work on the business plan.
So there’s a big gap here. It’s a phenomenon that’s been happening in the last year and it’s an uncomfortable place for IT. We’re not used to having to lead in that way. We have been more in the order-taker business.”
Veenod Kurup, Liberty Global, Group CIO.
That’s a really thought-provoking insight from Veenod isn’t it? Technology driving the business rather than business driving the technology. Technology as the business advantage.
I’ll be honest here – I never thought I’d see that day although I… guess… as e-business increases, the dependency on tech increases in lockstep. I’m passionate about tech, but also of the opinion that the tech is only a means to an end.
So if what Veenod says is reflective of a macro-trend across all industry then he’s right in saying that we’re going to have some very uncomfortable situations for some tech experts. Many will have to widen their field of view from a tech-only vision to a business vision.
Maybe instead the business-oriented groups could just come to the OSS / BSS department and speak with our valuable tripods. After all, we own and run the information and systems where business (BSS) meets technology (OSS) right? Or as previously reiterated on this blog, we have the opportunity to take the initiative and demonstrate the business value within our OSS / BSS.
PS. hat-tip to Dawn Bushaus for unearthing the quote from Veenod in her article on Inform.
“In 1972, the Club of Rome in its report The Limits to Growth predicted a steadily increasing demand for material as both economies and populations grew. The report predicted that continually increasing resource demand would eventually lead to an abrupt economic collapse. Studies on material use and economic growth show instead that society is gaining the same economic growth with much less physical material required. Between 1977 and 2001, the amount of material required to meet all needs of Americans fell from 1.18 trillion pounds to 1.08 trillion pounds, even though the country’s population increased by 55 million people. Al Gore similarly noted in 1999 that since 1949, while the economy tripled, the weight of goods produced did not change.” Wikipedia on the topic of Dematerialisation.
The weight of OSS transaction volumes appears to be increasing year on year as we add more stuff to our OSS. The touchpoint explosion is amplifying this further. Luckily, our platforms / middleware, compute, networks and storage have all been scaling as well so the increased weight has not been as noticeable as it might have been (even though we’ve all worked on OSS that have been buckling under the weight of transaction volumes right?).
Does it also make sense that when there is an incremental cost per transaction (eg via the increasingly prevalent cloud or “as a service” offerings) that we pay closer attention to transaction volumes because there is a great perception of cost to us? But not for “internal” transactions where there is little perceived incremental cost?
But it’s not so much the transaction processing volumes that are the problem directly. It’s more by implication. For each additional transaction there’s the risk of a hand-off being missed or mis-mapped or slowing down overall activity processing times. For each additional transaction type, there’s additional mapping, testing and regression testing effort as well as an increased risk of things going wrong.
Do you measure transaction flow volumes across your entire OSS suite? Does it provide an indication of where efficiency optimisation (ie dematerialisation) could occur and guide your re-factoring investments? Does it guide you on process optimisation efforts?
I recently had something of a perspective-flip moment in relation to automation within the realm of OSS.
In the past, I’ve tended to tackle the automation challenge from the perspective of applying automated / scripted responses to tasks that are done manually via the OSS. But it’s dawned on me that I have it around the wrong way! It is an incremental perspective on the main objective of automations – global zero-touch networks.
If we take all of the tasks performed by all of the OSS around the globe, the number of variants is incalculable… which probably means the zero-touch problem is unsolvable (we might be able to solve for many situations, but not all).
The more solvable approach would be to develop a more homogeneous approach to network self-care / self-optimisation. In other words, the majority of the zero-touch challenge is actually handled at the equivalent of EMS level and below (I’m perhaps using out-dated self-healing terminology, but hopefully terminology that’s familiar to readers) and only cross-domain issues bubble up to OSS level.
As the diagram below describes, each layer up abstracts but connects (as described in more detail in “What an OSS shouldn’t do“). That is, each higher layer in the stack reduces the amount if information/control within a domain that it’s responsible for, but it assumes more a broader responsibility for connecting domains together.
The abstraction process reduces the number of self-healing variants the OSS needs to handle. But to cope with the complexity of self-caring for connected domains, we need a more homogeneous set of health information being presented up from the network.
Whereas the intent model is designed to push actions down into the lower layers with a standardised, simplified language, this would be the reverse – pushing network health knowledge up to higher layers to deal with… in a standard, consistent approach.
And BTW, unlike the pervading approach of today, I’m clearly saying that when unforeseen (or not previously experienced) scenarios appear within a domain, they’re not just kicked up to OSS, but the domains are stoic enough to deal with the situation inside the domain.
For network operators, our OSS and BSS touch most parts of the business. The network, and the services they carry, are core business so a majority of business units will be contributing to that core business. As such, our OSS and BSS provide many of the metrics used by those business units.
This is a privileged position to be in. We get to see what indicators are most important to the business, as well as the levers used to control those indicators. From this privileged position, we also get to see the aggregated impact of all these KPIs.
In your years of working on OSS / BSS, how many times have you seen key business indicators that are conflicting between business units? They generally become more apparent on cross-team projects where the objectives of one internal team directly conflict with the objectives of another internal team/s.
In theory, a KPI tree can be used to improve consistency and ensure all business units are pulling towards a common objective… [but what if, like most organisations, there are many objectives? Does that mean you have a KPI forest and the trees end up fighting for light?]
But here’s a thought… Have you ever seen an OSS/BSS suite with the ability to easily build KPI trees? I haven’t. I’ve seen thousands of standalone reports containing myriad indicators, but never a consolidated roll-up of metrics. I have seen a few products that show operational metrics rolled-up into a single dashboard, but not business metrics. They appear to have been designed to show an information hierarchy, but not necessarily with KPI trees in mind specifically.
What do you think? Does it make sense for us to offer KPI trees as base product functionality from our reporting modules? Would this functionality help our OSS/BSS add more value back into the businesses we support?
One of the challenges facing OSS / BSS product designers is which platform/s to tie the roadmap to.
Let’s use a couple of examples.
In the past, most outside plant (OSP) designs were done in AutoCAD, so it made sense to build OSP design tools around AutoCAD. However in making that choice, the OSP developer becomes locked into whatever product directions AutoCAD chooses in future. And if AutoCAD falls out of favour as an OSP design format, the earlier decision to align with it becomes an even bigger challenge to work around.
A newer example is for supply chain related tools to leverage Salesforce. The tools and marketplace benefits make great sense as a platform to leverage today. But it also represents a roadmap lock-in that developers hope will remain beneficial in the years / decades to come.
I haven’t seen an OSS / BSS product built around Facebook, but there are plenty of other tools that have been. Given Facebook’s recent travails, I wonder how many product developers are feeling at risk due to their reliance on the Facebook platform?
The same concept extends to the other platforms we choose – operating systems, programming languages, messaging models, data storage models, continuous development tools, etc, etc.
Anecdotally, it seems that new platforms are coming online at an ever greater rate, so the chance / risk of platform churn is surely much higher than when AutoCAD was chosen back in the 1980s – 1990s.
The question becomes how to leverage today’s platform benefits, whilst minimising dependency and allowing easy swap-out. There’s too much nuance to answer this definitively, but the one key strategy is to try to build key logic models outside the dependent platform (ie ensuring logic isn’t built into AutoCAD or similar).
“There are three broad models of networking in use today. The first is the adaptive model where devices exchange peer information to discover routes and destinations. This is how IP networks, including the Internet, work. The second is the static model where destinations and pathways (routes) are explicitly defined in a tabular way, and the final is the central model where destinations and routes are centrally controlled but dynamically set based on policies and conditions.” Tom Nollehere.
OSS of decades past worked best with static networks. Services / circuits that were predominantly “nailed up” and (relatively) rarely changed after activation. This took the real-time aspect out of play and justified the significant manual effort required to establish a new service / circuit.
However, adaptive and centrally managed networks have come to dominate the landscape now. In fact, I’m currently working on an assignment where DWDM, a technology that was once largely static, is now being augmented to introduce an SDN controller and dynamic routing (at optical level no less!).
This paradigm shift changes the fundamentals of how OSS operate. Apart from the physical layer, network connectivity is now far more transient, so our tools must be able to cope with that. Not only that, but the changes are too frequent to justify the manual effort of the past.
There’s something I’ve noticed about OSS products – they are either designed to be abstract / flexible or they are designed to cater for specific technologies / topologies.
When designed from the abstract perspective, the tools are built around generic core data models. For example, whether a virtual / logical device, a physical device, a customer premises device, a core network device, an access network device, a trasmission device, a security device, etc, they would all be lumped into a single object called a device (but with flexibility to cater for the many differences).
When designed from the specific perspective, the tools are built with exact network and/or service models in mind. That could be 5G, DWDM, physical plant, etc and when any new technology / topology comes along, a new plug-in has to be built to augment the tools.
The big advantage of the abstract approach is obvious (if they truly do have a flexible core design) – you don’t have to do any product upgrades to support a new technology / topology. You just configure the existing system to mimic the new technology / topology.
The first OSS product I worked with had a brilliant core design and was able to cope with any technology / topology that we were faced with. It often just required some lateral thinking to make the new stuff fit into the abstract data objects.
What I also noticed was that operators always wanted to customise the solution so that it became more specific. They were effectively trying to steer the product from an open / abstract / flexible tool-set to the other end of the scale. They generally paid significant amounts to achieve specificity – to mould the product to exactly what their needs were at that precise moment in time.
However, even as an OSS newbie, I found that to be really short-term thinking. A network (and the services it carries) is constantly evolving. Equipment goes end-of-life / end-of-support every few years. Useful life models are usually approx 5-7 years and capital refresh projects tend to be ongoing. Then, of course, vendors are constantly coming up with new products, features, topologies, practices, etc.
Given this constant change, I’d much rather have a set of tools that are abstract and flexible rather than specific but less malleable. More importantly, I’ll always try to ensure that any customisations should still retain the integrity of the abstract and flexible core rather than steering the product towards specificity.