The Ineffective OSS Scoreboard Analogy

Imagine for a moment that you’re the coach of a sporting team. You train your team and provide them with a strategy for the game. You send them out onto the court and let them play.

The scoreboard gives you all of the stats about each player. Their points, blocks, tackles, heart-rate, distance covered, errors, etc. But it doesn’t show the total score for each team or the time remaining in the game. 

That’s exactly what most OSS reports and dashboards are like! You receive all of the transactional data (eg alarms, truck-rolls, device performance metrics, etc), but not how you’re collectively tracking towards team objectives (eg growth targets, risk reduction, etc). 

Yes, you could infer whether the team is doing well by reverse engineering the transactional data. Yes, you could then apply strategies against those inferences in the hope that it has a positive impact. But that’s a whole lot of messing around in the chaos of the coach’s box with the scores close (you assume) and the game nearing the end (possibly). You don’t really know when the optimal time is to switch your best players back into the game.

As coach with funding available, would you be asking your support team to give you more transactional tools / data or the objective-based insights?

Does this analogy help articulate the message from the previous two posts (Wed and Thurs)?

PS. What if you wanted to build a coach-bot to replace yourself in the near future? Are you going to build automations that close the feedback loop against transactional data or are you going to be providing feedback that pulls many levers to optimise team objectives?

One big requirement category most OSS can’t meet

We talked yesterday about a range of OSS products that are more outcome-driven than our typically transactional OSS tools. There’s not many of them around at this stage. I refer to them as “data bridge” products.
 
Our typical OSS tools help manage transactions (alarms, activate customers services, etc). They’re generally not so great at (directly) managing objectives such as:
  • Sign up an extra 50,000 customers along the new Southern network corridor this month
  • Optimise allocation of our $10M capital budget to improve average attainable speeds by 20% this financial year
  • Achieve 5% revenue growth in Q3
  • Reduce truck rolls by 10% in the next 6 months
  • Optimal management of the many factors that contribute to churn, thus reducing churn risk by 7% by next March
 
We provide tools to activate the extra 50,000 customers. We also provide reports / dashboards that visualise the numbers of activations. But we don’t tend to include the tools to manage ongoing modelling and option analysis to meet key objectives. Objectives that are generally quantitative and tied to time, cost, etc and possibly locations/regions. 
 
These objectives are often really difficult to model and have multiple inputs. Managing to them requires data that’s changing on a daily basis (or potentially even more often – think of how a single missed truck-roll ripples out through re-calculation of optimal workforce allocation).
 
That requires:
  • Access to data feeds from multiple sources (eg existing OSS, BSS and other sources like data lakes)
  • Near real-time data sets (or at least streaming or regularly updating data feeds)
  • An ability to quickly prepare and compare options (data modelling, possibly using machine-based learning algorithms)
  • Advanced visualisations (by geography, time, budget drawdown and any graph types you can think of)
  • Flexibility in what can be visualised and how it’s presented
  • Methods for delivering closed-loop feedback to optimise towards the objectives (eg RPA)
  • Potentially manage many different transaction-based levers (eg parallel project activities, field workforce allocations, etc) that contribute to rolled-up objectives / targets
 
You can see why I refer to this as a data bridge product right? I figure that it sits above all other data sources and provides the management bridge across them all. 
 
PS. If you want to know the name of the existing products that fit into the “data bridge” category, please leave us a message.

Do you want funding on an OSS project?

OSS tend to be very technical and transactional in nature. For example, a critical alarm happens, so we have to coordinate remedial actions as soon as possible. Or, a new customer has requested service so we have to coordinate the workforce to implement certain tasks in the physical and logical/virtual world. When you spend so much of your time solving transactional / tactical problems, you tend to think in a transactional / tactical way.
 
You can even see that in OSS product designs. They’ve been designed for personas who solve transactional problems (eg alarms, activations, etc). That’s important. It’s the coal-face that gets stuff done.
 
But who funds OSS projects? Are their personas thinking at a tactical level? Perhaps, but I suspect not on a full-time basis. Their thoughts might dive to a tactical level when there are outages or poor performance, but they’ll tend to be thinking more about strategy, risk mitigation and efficiency if/when they can get out of the tactical distractions.
 
Do our OSS meet project sponsor needs? Do our OSS provide functionality that help manage strategy, risk and efficiency? Well, our OSS can help with reports and dashboards that help them. But do reports and dashboards inspire them enough to invest millions? Could sponsors rightly ask, “I’m spending money, but what’s in it for me?”
 
What if we tasked our product teams to think in terms of business objectives instead of transactions? The objectives may include rolled-up transaction-based data and other metrics of course. But traditional metrics and activities are just a means to an end.
 
You’re probably thinking that there’s no way you can retrofit “objective design” into products that were designed years ago with transactions in mind. You’d be completely correct in most cases. So what’s the solution if you don’t have retrofit control over your products?
 
Well, there’s a class of OSS products that I refer to as being “the data bridge.” I’ll dive into more detail on these currently rare products tomorrow.

An OSS checksum

Yesterday’s post discussed two waves of decisions stemming from our increasing obsession with data collection.

“…the first wave had [arisen] because we’d almost all prefer to make data-driven decisions (ie decisions based on “proof”) rather than “gut-feel” decisions.

We’re increasingly seeing a second wave come through – to use data not just to identify trends and guide our decisions, but to drive automated actions.”

Unfortunately, the second wave has an even greater need for data correctness / quality than we’ve experienced before.

The first wave allowed for human intervention after the collection of data. That meant human logic could be applied to any unexpected anomalies that appeared.

With the second wave, we don’t have that luxury. It’s all processed by the automation. Even learning algorithms struggle with “dirty data.” Therefore, the data needs to be perfect and the automation’s algorithm needs to flawlessly cope with all expected and unexpected data sets.

Our OSS have always had a dependence on data quality so we’ve responded with sophisticated ways of reconciling and maintaining data. But the human logic buffer afforded a “less than perfect” starting point, as long as we sought to get ever-closer to the “perfection” asymptote.

Does wave 2 require us to solve the problem from a fundamentally different starting point? We have to assume perfection akin to a checksum of correctness.

Perfection isn’t something I’m very qualified at, so I’m open to hearing your ideas. 😉

 

Riffing with your OSS

Data collection and data science is becoming big business. Not just in telco – our OSS have always been one of the biggest data gatherers around – but across all sectors that are increasingly digitising (should I just say, “all sectors” because they’re all digitising?).

Why do you think we’re so keen to collect so much data?

I’m assuming that the first wave had mainly been because we’d almost all prefer to make data-driven decisions (ie decisions based on “proof”) rather than “gut-feel” decisions.

We’re increasingly seeing a second wave come through – to use data not just to identify trends and guide our decisions, but to drive automated actions.

I wonder whether this has the potential to buffer us from making key insights / observations about the business, especially senior leaders who don’t have the time to “science” their data? Have teams already cleansed, manipulated, aggregated and presented data, thus stripping out all the nuances before senior leaders ever even see your data?

I regretfully don’t get to “play” with data as much as I used to. I say regretfully because looking at raw data sets often gives you the opportunity to identify trends, outliers, anomalies and patterns that might otherwise remain hidden. Raw data also gives you the opportunity to riff off it – to observe and then ask different questions of the data.

How about you? Do you still get the opportunity to observe and hypothesise using raw OSS/BSS data? Or do you make your decisions using data that’s already been sanitised (eg executive dashboards / reports)?

 

OSS diamonds are forever (part 2)

Wednesday’s post discussed how OPEX is forever, just like the slogan for diamonds.
 
As discussed, some aspects of Operational Expenses are well known when kicking off a new OSS project (eg annual OSS license / support costs). Others can slip through the cracks – what I referred to as OPEX leakage (eg third-party software, ongoing maintenance of software customisations).
 
OPEX leakage might be an unfair phrase. If there’s a clear line of sight from the expenses to a profitable return, then it’s not leakage. If costs (of data, re-work, cloud services, applications, etc) are proliferating with no clear benefit, then the term “leakage” is probably fair.
 
I’ve seen examples of Agile and cloud implementation strategies where leakage has occurred. And even the supposedly “cheap” open-source strategies have led to surprises. OPEX leakage has caused project teams to scramble as their financial year progressed and budgets were unexpectedly being exceeded.
 
Oh, and one other observation to share that you may’ve seen examples of, particularly if you’ve worked on OSS in large organisations – Having OPEX incurred by one business unit but the benefit derived by different business units. This can cause significant problems for the people responsible for divisional budgets, even if it’s good for the business as a whole. 
 
Let me explain by example: An operations delivery team needs extralogging capability so they stand up a new open-source tool. They make customisations so that log data can be collected for all of their network types. All log data is then sent to the organisation’s cloud instance. The operations delivery team now owns lifecycle maintenance costs. However, the cost of cloud (compute and storage) and data lake licensing have now escalated but Operations doesn’t foot that bill. They’ve just handed that “forever” budgetary burden to another business unit.
 
The opposite can also be true. The costs of build and maintain might be borne by IT or ops, but the benefits in revenue or CX (customer experience) are gladly accepted by business-facing units.
 
Both types of project could give significant whole-of-company benefit. But the unit doing the funding will tend to choose projects that are less effective if it means their own business unit will derive benefit (especially if individual’s bonuses are tied to those results).
 
OSS can be powerful tools, giving and receiving benefit from many different business units. However, the more OPEX-centric OSS projects that we see today are introducing new challenges to get funded and then supported across their whole life-cycle.
 
PS. Just like diamonds bought at retail prices, there’s a risk that the financials won’t look so great a year after purchase. If that’s the case, you may have to seek justification on intangible benefits.  😉
 
PS2. Check out Robert’s insightful comment to the initial post, including the following question, “I wonder how many OSS procurements are justified on the basis of reducing the Opex only *of the current OSS*, rather than reducing the cost of achieving what the original OSS was created to do? The former is much easier to procure (but may have less benefit to the business). The latter is harder (more difficult analysis to do and change to manage, but payoff potentially much larger).”

Diamonds are Forever and so is OSS OPEX

Sourced from: www.couponraja.in

I sometimes wonder whether OPEX is underestimated when considering OSS investments, or at least some facets (sorry, awful pun there!) of it.

Cost-out (aka head-count reduction) seems to be the most prominent OSS business case justification lever. So that’s clearly not underestimated. And the move to cloud is also an OPEX play in most cases, so it’s front of mind during the procurement process too. I’m nought for two so far! Hopefully the next examples are a little more persuasive!

Large transformation projects tend to have a focus on the up-front cost of the project, rightly so. There’s also an awareness of ongoing license costs (usually 20-25% of OSS software list price per annum). Less apparent costs can be found in the exclusions / omissions. This is where third-party OPEX costs (eg database licenses, virtualisation, compute / storage, etc) can be (not) found.

That’s why you should definitely consider preparing a TCO (Total Cost of Ownership) model that includes CAPEX and OPEX that’s normalised across all options when making a buying decision.

But the more subtle OPEX leakage occurs through customisation. The more customisation from “off-the-shelf” capability, the greater the variation from baseline, the larger the ongoing costs of maintenance and upgrade. This is not just on proprietary / commercial software, but open-source products as well.

And choosing Agile almost implies ongoing customisation. One of the things about Agile is it keeps adding stuff (apps, data, functions, processes, code, etc) via OPEX. It’s stack-ranked, so it’s always the most important stuff (in theory). But because it’s incremental, it tends to be less closely scrutinised than during a CAPEX / procurement event. Unless carefully monitored, there’s a greater chance for OPEX leakage to occur.

And as we know about OPEX, like diamonds, they’re forever (ie the costs re-appear year after year). 

A billion dollar bid

A few years ago I was lucky enough to be invited to lead a bid. I say lucky because the partner organisations are two of the most iconic firms in the tech industry. The bid was for bleeding-edge work, potentially worth well over a billion dollars. I was a little surprised to be honest. I mean, two tech titans, with many very, very clever people, much cleverer than me. Why would they need to look outside and engage me?

As it turned out, the answer became clear within the first few meetings. And whilst the project had little to do with OSS, it certainly had (has) parallels in the world of OSS.

Both of the organisations were highly siloed. Each product / capability silo had immense talent and immense depth to it. Our combined team had many PhDs who could discuss their own silo for hours, but could only point me in the general direction of what plugged into their products. 

Clearly, I was engaged to figure out the required end-to-end solution for the customer and then how to bolt the two sets of silos into that solution framework.

The same is true when looking for OSS solution gaps, in my experience at least. If you look into a domain or a product, the functionality / capability is usually quite well defined, understood and supported. For example, alarm / event managers are invariably very good at managing alarm / event lists.

If you’re going to find gaps, they’re more likely to be found in the end-to-end solution – in the handoffs, responsibility demarcation points, interfaces and processes that cross between silos. That’s why external consultancies can prove valuable for large organisations. They generally look into the cross-domain solution performance.

As you’d already know, the end-to-end solution is a combination of people, process and technology. Even so, as the “manager of managers,” I’m not sure our OSS tech is solving this problem as well as it could. Is there even a “glue” product that’s missing from our OSS/BSS stack?

Sure, we have some tools that fit this purpose – workflow engines, messaging buses, orchestration engines, data lakes, etc. Yet I still feel there’s an opportunity to do it far better. And the opportunity probably extends far beyond just OSS and into the broader IT industry.

What have you done to help solve this problem on your OSS suites?

PS. If you’re wondering what happened to the bid. Well, the team was excited to have made the shortlist of 3, but then the behemoths decided to withdraw from the race. Turns out that winning the bid could’ve jeopardised the even bigger supply contracts they already had with the client. Boggles the mind to think there were bigger contracts already in play!!

 

Inventory Management re-states its case

In a post last week we posed the question on whether Inventory Management still retains relevance. There are certainly uses cases where it remains unquestionably needed. But perhaps others that are no longer required, a relic of old-school processes and data flows.
 
If you have an extensive OSP (Outside Plant) network, you have almost no option but to store all this passive infrastructure in an Inventory Management solution. You don’t have the option of having an EMS (Element Management System) console / API to tell you the current design/location/status of the network. 
 
In the modern world of ubiquitous connection and overlay / virtual networks, Inventory Management might be less essential than it once was. For service qualification, provisioning and perhaps even capacity planning, everything you need to know is available on demand from the EMS/s. The network is a more correct version of the network inventory than external repository (ie Inventory Management) can hope to be, even if you have great success with synchronisation.
 
But I have a couple of other new-age use-cases to share with you where Inventory Management still retains relevance.
 
One is for connectivity (okay so this isn’t exactly a new-age use-case, but the scenario I’m about to describe is). If we have a modern overlay / virtual network, anything that stays within a domain is likely to be better served by its EMS equivalent. Especially since connectivity is no longer as simple as physical connections or nearest neighbours with advanced routing protocols. But anything that goes cross-domain and/or off-net needs a mechanism to correlate, coordinate and connect. That’s the role the Inventory Manager is able to do (conceptually).
 
The other is for digital twinning. OSS (including Inventory Management) was the “original twin.” It was an offline mimic of the production network. But I cite Inventory Management as having a new-age requirement for the digital twin. I increasingly foresee the need for predictive scenarios to be modelled outside the production network (ie in the twin!). We want to try failure / degradation scenarios. We want to optimise our allocation of capital. We want to simulate and optimise customer experience under different network states and loads. We’re beginning to see the compute power that’s able to drive these scenarios (and more) at scale.
 
Is it possible to handle these without an Inventory Manager (or equivalent)?

When OSS experts are wrong

When experts are wrong, it’s often because they’re experts on an earlier version of the world.”
Paul Graham.
 
OSS experts are often wrong. Not only because of the “earlier version of the world” paradigm mentioned above, but also the “parallel worlds” paradigm that’s not explicitly mentioned. That is, they may be experts on one organisation’s OSS (possibly from spending years working on it), but have relatively little transferable expertise on other OSS.
 
It would be nice if the OSS world view never changed and we could just get more and more expert at it, approaching an asymptote of expertise. Alas, it’s never going to be like that. Instead, we experience a world that’s changing across some of our most fundamental building blocks.
 
We are the sum total of our experiences.”
B.J. Neblett.
 
My earliest forays into OSS had a heavy focus on inventory. The tie-in between services, logical and physical inventory (and all use-cases around it) was probably core to me becoming passionate about OSS. I might even go as far as saying I’m “an Inventory guy.”
 
Those early forays occurred when there was a scarcity mindset in network resources. You provisioned what you needed and only expanded capacity within tight CAPEX envelopes. Managing inventory and optimising revenue using these scarce resources was important. We did that with the help of Inventory Management (IM) tools. Even end-users had a mindset of resource scarcity. 
 
But the world has changed. We now operate with a cloud-inspired abundance mindset. We over-provision physical resources so that we can just spin up logical / virtual resources whenever we wish. We have meshed, packet-switched networks rather than nailed up circuits. Generally speaking, cost per resource has fallen dramatically so we now buy a much higher port density, compute capacity, dollar per bit, etc. Customers of the cloud generation assume abundance of capacity that is even available in small consumption-based increments. In many parts of the world we can also assume ubiquitous connectivity.
 
So, as “an inventory guy,” I have to question whether the scarcity to abundance transformation might even fundamentally change my world-view on inventory management. Do I even need an inventory management solution or should I just ask the network for resources when I want to turn on new customers and assume the capacity team has ensured there’s surplus to call upon?
 
Is the enormous expense we allocate to building and reconciling a digital twin of the network (ie the data gathered and used by Inventory Management) justified? Could we circumvent many of the fallouts (and a multitude of other problems) that occur because the inventory data doesn’t accurately reflect the real network?
 
For example, in the old days I always loved how much easier it was to provision a customer’s mobile / cellular or IN (Intelligent Network) service than a fixed-line service. It was easier because fixed-line service needed a whole lot more inventory allocation and reservation logic and process. Mobile / IN services didn’t rely on inventory, only an availability of capacity (mostly). Perhaps the day has almost come where all services are that easy to provision?
 
Yes, we continue to need asset management and capacity planning. Yes, we still need inventory management for physical plant that has no programmatic interface (eg cables, patch-panels, joints, etc). Yes, we still need to carefully control the capacity build-out to CAPEX to revenue balance (even more so now in a lower-profitability operator environment). But do many of the other traditional Inventory Management and resource provisioning use cases go away in a world of abundance?
 

 

I’d love to hear your opinions, especially from all you other “inventory guys” (and gals)!! Are your world-views, expertise and experiences changing along these lines too or does the world remain unchanged from your viewing point?
 
Hat tip to Garry for the seed of this post!

Google’s Circular Economy in OSS

OSS wear many hats and help many different functions within an organisation. One function that OSS assists might be surprising to some people – the CFO / Accounting function.

The traditional service provider business model tends to be CAPEX-heavy, with significant investment required on physical infrastructure. Since assets need to be depreciated and life-cycle managed, Accountants have an interest in the infrastructure that our OSS manage via Inventory Management (IM) tools.

I’ve been lucky enough to work with many network operators and see vastly different asset management approaches used by CFOs. These strategies have ranged from fastidious replacement of equipment as soon as depreciation cycles have expired through to building networks using refurbished equipment that has already passed manufacturer End-of-Life dates. These strategies fundamentally effect the business models of these operators.

Given that telecommunications operator revenues are trending lower globally, I feel it’s incumbent on us to use our OSS to deliver positive outcomes to global business models. 

With this in mind, I found this article entitled, “Circular Economy at Work in Google Data Centers,” to be quite interesting. It cites, “Google’s circular approach to optimizing end of life of servers based on Total Cost of Ownership (TCO) principles have resulted in hundreds of millions per year in cost avoidance.”

Google Asset Lifecycle

Asset lifecycle management is not your typical focus area for OSS experts, but an area where we can help add significant value for our customers!

Some operators use dedicated asset management tools such as SAP. Others use OSS IM tools. Others reconcile between both. There’s no single right answer.

For a deeper dive into ideas where our OSS can help in asset lifecycle (which Google describes as its Circular Economy and seems to manage using its ReSOLVE tool), I really recommend reviewing the article link above.

If you need to develop such a tool using machine learning models, reach out to us and we’ll point you towards some tools equivalent to ReSOLVE to augment your OSS.

Another OSS “forehead-slap” moment!

I don’t know about you, but I find this industry of ours has a remarkable ability to keep us humble. Barely a day goes by when I don’t have to slap my forehead and say, “uhhh…. of course!” (or perhaps, “D’oh!!”)

I had one such instance yesterday. I couldn’t figure out why a client’s telemetry / performance-management suite needed an inventory ingestion interface. Can you think of a reason (you probably can)???

My mind had followed the line of thinking that it was for reconciling with traditional inventory systems or perhaps some sort of topology reckoning. It’s far more rudimentary than that. 

Have you figured out what it might be used for yet?

Enrichment!

For example, if device names (hostnames) attached to the metrics aren’t human-readable, simple, just enrich the data with its human-readable alternate name. If you don’t know what device type is generating sub-sets of metrics, no problems, just enrich the data.

I’d heard of enrichment of alarms/event of course, but hadn’t followed that line of thinking for performance management before. Does your performance management stack allow you to enrich its data sets?

Seems obvious in hindsight! Smacked down again!!

I’d love to hear any anecdotes you have where OSS gave you a “forehead slap” moment.

Over 30 Autonomous Networking User Stories

The following is a set of user stories I’ve provided to TM Forum to help with their current Autonomous Networking initiative.

They’re just an initial discussion point for others to riff off. We’d love to get your comments, additions and recommended refinements too.

As a Head of Network Operations, I want to Automatically maintain the health of my network (within expected tolerances if necessary) So that Customer service quality is kept to an optimal level with little or no human intervention
As a Head of Network Operations, I want to Ensure the overall solution is designed with machine-led automations as a guiding principle So that Human intervention can not be easily engineered into the systems/processes
As a Head of Network Operations, I want to Automatically identify any failures of resources or services within the entire network So that All relevant data can be collected, logged, codified and earmarked for effective remedial action without human interaction
As a Head of Network Operations, I want to Automatically identify any degradation of resource or service performance within the network So that All relevant data can be collected, logged, codified and earmarked for effective remedial action without human interaction
As a Head of Network Operations, I want to Map each codified data set (for failure or degradation cases) to a remedial action plan So that Remedial activities can be initiated without human interaction
As a Head of Network Operations, I want to Identify which remedial activities can be initiated via a programmatic interface and which activities require manual involvement such as a truck roll So that Even manual activities can be automatically initiated
As a Head of Network Operations, I want to Ensure that automations are able to resolve all known failure / degradation scenarios So that Activities can be initiated for any failure or degradation and be automatically resolved through to closure (with little or no human intervention)
As a Head of Network Operations, I want to Ensure there is sufficient network resilience So that Any failure or degradation can be automatically bypassed (temporarily or permanently)
As a Head of Network Operations, I want to Ensure there is sufficient resilience within all support systems So that Any failure or degradation can be automatically bypassed (temporarily or permanently) to ensure customer service is maintained
As a Head of Network Operations, I want to Ensure that operator initiated changes (eg planned maintenance, software upgrades, etc) automatically generate change tracking, documentation and logging So that The change can be monitored (by systems and humans where necessary) to ensure there is minimal or no impact to customer services, but also to ensure resolution data is consistently recorded
As a Head of Network Operations, I want to Ensure that customer initiated changes (eg by raising an incident) automatically generate change tracking, documentation and logging So that The change can be monitored (by systems and humans where necessary) to ensure the incident is closed expediently, but also to ensure resolution data is consistently recorded
As a Head of Network Operations, I want to Initiate planned outages with or without triggering automated remedial activities So that The change agents can decide to use automations or not and ensure automations don’t adversely effect the activities that are scheduled for the planned outage window
As a Head of Network Operations, I want to Ensure that if an unplanned outage does occur, impacted customers are automatically notified (on first instance and via a communications sequence if necessary throughout the outage window) So that Customer experience can be managed as best possible
As a Head of Network Operations, I want to Ensure that if an unplanned outage does occur without a remedial action being triggered, a post-mortem analysis is initiated So that Automations can be revised to cope with this previously unhandled outage scenario
As a Head of Network Operations, I want to Ensure that even previously un-seen new fail scenarios can be handled by remedial automations So that Customer service quality is kept to an optimal level with little or no human intervention
As a Head of Network Operations, I want to Automatically monitor the effects of remedial actions So that Remedial automations don’t trigger race conditions that result in further degradation and/or downstream impacts
As a Head of Network Operations, I want to Be able to manually override any automations by following a documented sequence of events So that If a race condition is inadvertently triggered by an automation, it can be negated quickly and effectively before causing further degradation
As a Head of Network Operations, I want to Intentionally trigger network/service outages and/or degradations, including cascaded scenarios on an scheduled and/or randomised basis So that The resilience of the network and systems can be thoroughly tested (and improved if necessary)
As a Head of Network Operations, I want to Intentionally trigger network/service outages and/or degradations, including cascaded scenarios on an ad-hoc basis So that The resilience of the network and systems can be thoroughly tested (and improved if necessary)
As a Head of Network Operations, I want to Perform scheduled compliance checks on the network So that Expected configurations and policies are in place across the network
As a Head of Network Operations, I want to Automatically generate scheduled reports relating to the effectiveness of the network, services and automations So that The overall solution health (including automations) can be monitored
As a Head of Network Operations, I want to Automatically generate dashboards (in near-real-time) relating to the effectiveness of the network, services and automations So that The overall solution health (including automations) can be monitored
As a Head of Network Operations, I want to Ensure that automations are able to extend across all domains within the solution So that Remedial actions aren’t constrained by system hand-offs
As a Head of Network Operations, I want to Ensure configuration backups are performed automatically on all relevant systems (eg EMS, OSS, etc) So that A recent good solution configuration can be stored as protection in case automations fail and corrupt configurations within the system
As a Head of Network Operations, I want to Ensure configuration restores are performed and tested automatically on all relevant systems (eg EMS, OSS, etc) So that A recent good solution configuration can be reverted to in case automations fail and corrupt configurations within the system
As a Head of Network Operations, I want to Ensure automations are able to manage the entire service lifecycle (add, modify/upgrade, suspend, restore, delete) So that Customer services can evolve to meet client expectations with little or no human intervention
As a Head of Network Operations, I want to Have a design and architecture that uses intent-based and/or policy-based actions So that The complexity of automations is minimised (eg automations don’t need to consider custom rules for different device makes/models, etc)
As a Head of Network Operations, I want to Ensure as many components of the solution (eg EMS, OSS, customer portals, etc) have programmatic interfaces (even if manual activities are required in back-end processes) So that Automations can initiate remedial actions in near real time
As a Head of Network Operations, I want to Ensure all components and data flows within the solution are securely hardened (eg encryption of data in motion and at rest) So that The power of the autonomous platform can not be leveraged for nefarious purposes
As a Head of Network Operations, I want to Ensure that all required metrics can be automatically sourced from the network / systems in as near real time as feasible / useful So that Automations have the full set of data they need to initiate remedial actions and it is as up-to-date as possible for precise decision-making
As a Head of Network Operations, I want to Use the power of learning machines So that The sophistication and speed of remedial response is faster, more accurate and more reliable than if manual interaction were used
As a Head of Network Operations, I want to Record actual event patterns and replay scenarios offline So that Event clusters and response patterns can be thoroughly tested as part of the certification process prior to being released into production environments
As a Head of Network Operations, I want to Capture metrics that can be cross-referenced against event patterns and remedial actions So that Regressions and/or refinements can improve existing automations (ie continuous retraining of the model)
As a Head of Network Operations, I want to Be able to seed a knowledge base with relevant event/action data, whether the pattern source is from Production, an offline environment, a digital twin environment or other production-like environments So that The database is able to identify real scenarios, even if  scenarios are intentially initiated, but could potentially cause network degradation to a production environment
As a Head of Network Operations, I want to Ensure that programmatic interfaces also allow for revert / rollback capabilities So that Remedial actions that aren’t beneficial can be rolled back to the previous state; OR other remedial actions are performed, allowing the automation to revert to original configuration / state
As a Head of Network Operations, I want to Be able to initiate circuit breakers to override any automations So that If a race condition is inadvertently triggered by an automation, it can be negated quickly and effectively before causing further degradation
As a Head of Network Operations, I want to Manually or automatically generate response-plans (ie documented sequences of activities) for any remedial actions fed back into the system So that Internal (eg quality control) or external (eg regulatory) bodies can review “best-practice” remedial activities at any point in time
As a Head of Network Operations, I want to Intentionally trigger catastrophic network failures (in non-prod environments) So that We can trial many remedial actions until we find an optimal solution to seed the knowledge base with

H-OSS-ton, we have a problem

You’ve all probably seen this scene from the Tom Hanks movie, Apollo 13 right? But you’re probably wondering what it has to do with OSS?

Well, this scene came to mind when I was preparing a list of user stories required to facilitate Autonomous Networking.

More specifically, to the use-case where we want the Autonomous Network to quickly recover (as best it can) from unplanned catastrophic network failures.

Of course we don’t want catastrophic network failures in production environments, but if one does occur, we’d prefer that our learning machines already have some idea on how to respond to any unlikely situation. We don’t want them to be learning response mechanisms after a production event.

But similarly, we don’t want to trigger massive outages on production just to build up a knowledge base of possible cause-effect groupings. That would be ridiculous.

That’s where the Apollo 13 analogy comes into play:

  • The engineers on the ground (ie the non-prod environment) were tasked with finding a solution to the problem (as they said, “fitting a square peg in a round hole”)
  • The parts the Engineers were given matched the parts available in the spacecraft (ie non-prod and prod weren’t an exact match, but enough of a replica to be useful)
  • The Engineers were able to trial many combinations using the available parts until they found a workable resolution to the problem (even if it relied heavily on duct tape!)
  • Once the workable solution was found, it was codified (as a procedure manual) and transferred to the spacecraft (ie migrating seed data from non-prod to prod)

If I were responsible for building an Autonomous Network, I’d want to dream up as many failure scenarios as I could, initiate them in non-prod and then duct-tape* solutions together for them all… and then attempt to pre-seed those learnings into production.

* By “duct-tape” I mean letting the learning machine attempt to find optimal solutions by trialing different combinations of automated / programmatic and manual interventions.

We use time-stamping in OSS, but what about geo-stamping?

A slightly left-field thought dawned on me the other day and I’d like to hear your thoughts on it.

We all know that almost all telemetry coming out of our networks is time-stamped. Events, syslogs, metrics, etc. That makes perfect sense because we look for time-based ripple-out effects when trying to diagnose issues.

But therefore does it also make sense to geo-stamp telemetry data too? Just as time-based ripple-out is common, so too are geographic / topological (eg nearest neighbour and/or power source) ripple-out effects.

If you want to present telemetry data as a geo/topo overlay, you currently have to enrich the telemetry data set first. Typically that means identifying the device name that’s generating the data and then doing a query on huge inventory databases to find the location and connectivity that corresponds to that device.

It’s usually not a complex query, but consider how much processing power must go into enriching at the enormous scale of telemetry records.

For stationary devices (eg core routers), it might seem a bit absurd adding a fixed geo-code (which has to be manually entered into the device once) to every telemetry record, but it seems computationally far more efficient than data lookups (please correct me if I’m wrong here!). For devices that move around (eg routers on planes), hopefully they already have GPS sensors to provide geo-stamp data.

What do you think? Am I stating a problem that has already been solved and/or is not worth solving? Or does it have merit?

The Autonomous Network / OSS Clock

In yesterday’s post, we talked about what needs to happen for a network operator to build an autonomous network. Many of the factors extended beyond the direct control of the OSS stack. We also looked at the difference between designing network autonomy for an existing OSS versus a ground-up build of an autonomous network.

We mostly looked at the ground-up build yesterday (at the expense of legacy augmentation).

So let’s take a slightly closer look at legacy automation. Like any legacy situation, you need to first understand current state. I’ve heard colleagues discuss the level of maturity of an existing network operations stack in terms of a single metric.

However, I feel that this might miss some of the nuances of the situation. For example, different activities are likely to be at different levels of maturity. Hence, the attempt at benchmarking the current situation on the OSS or Autonomous Networking clock below.

OSS Autonomy Clock

Sample activities shown in grey boxes to demonstrate the concept (I haven’t invested enough time into what the actual breakdown of activities might be yet).

  • Midnight is no monitoring capability
  • 3AM is Reactive Mode (ie reacting to data presented by the network / systems)
  • 6AM is Predictive Mode (ie using historical learnings to identify future situations)
  • 9AM is Prescriptive / Pre-cognitive Mode (ie using historical learnings, or pre-cognitive capabilities to identify what to do next)
  • Mid-day is Autonomous Networking (ie to close the loop and implement / control actions that respond to current situations automatically)

As always, I’d love to hear your thoughts!

As a network owner….

….I want to make my network so observable, reliable, predictable and repeatable that I don’t need anyone to operate it.

That’s clearly a highly ambitious goal. Probably even unachievable if we say it doesn’t need anyone to run it. But I wonder whether this has to be the starting point we take on behalf of our network operator customers?

If we look at most networks, OSS, BSS, NOC, SOC, etc (I’ll call this whole stack “the black box” in this article), they’ve been designed from the ground up to be human-driven. We’re now looking at ways to automate as many steps of operations as possible.

If we were to instead design the black-box to be machine-driven, how different would it look?

In fact, before we do that, perhaps we have to take two unique perspectives on this question:

  1. Retro-fitting existing black-boxes to increase their autonomy
  2. Designing brand new autonomous black-boxes

I suspect our approaches / architectures will be vastly different.

The first will require a incredibly complex measure, command and control engine to sit over top of the existing black box. It will probably also need to reach into many of the components that make up the black box and exert control over them. This approach has many similarities with what we already do in the OSS world. The only exception would be that we’d need to be a lot more “closed-loop” in our thinking. I should also re-iterate that this is incredibly complex because it inherits an existing “decision tree” of enormous complexity and adds further convolution.

The second approach holds a great deal more promise. However, it will require a vastly different approach on many levels:

  1. We have to take a chainsaw to the decision tree inside the black box. For example:
    • We start by removing as much variability from the network as possible. Think of this like other utilities such as water or power. Our electricity service only has one feed-type for almost all residential and business customers. Yet it still allows us great flexibility in what we plug into it. What if a network operator were to simply offer a “broadband dial-tone” service and end users decide what they overlay on that bit-stream
    • This reduces the “protocol stack” in the network (think of this in terms of the long list of features / tick-boxes on any router’s brochure)
    • As well as reducing network complexity, it drastically reduces the variables an end-user needs to decide from. The operator no longer needs 50 grandfathered, legacy products 
    • This also reduces the decision tree in BSS-related functionality like billing, rating, charging, clearing-house
    • We achieve a (globally?) standardised network services catalog that’s completely independent of vendor offerings
    • We achieve a more standardised set of telemetry data coming from the network
    • In turn, this drives a more standardised and minimal set of service-impact and root-cause analyses
  2. We design data input/output methods and interfaces (to the black box and to any of its constituent components) to have closed-loop immediacy in mind. At the moment we tend to have interfaces that allow us to interrogate the network and push changes into the network separately rather than tasking the network to keep itself within expected operational thresholds
  3. We allow networks to self-regulate and self-heal, not just within a node, but between neighbours without necessarily having to revert to centralised control mechanisms like OSS
  4. All components within the black-box, down to device level, are programmable. [As an aside, we need to consider how to make the physical network more programmable or reconcilable, considering that cables, (most) patch panels, joints, etc don’t have APIs. That’s why the physical network tends to give us the biggest data quality challenges, which ripples out into our ability to automate networks]
  5. End-to-end data flows (ie controls) are to be near-real-time, not constrained by processing lags (eg 15 minute poll cycles, hourly log processing cycles, etc) 
  6. Data minimalism engineering. It’s currently not uncommon for network devices to produce dozens, if not hundreds, of different metrics. Most are never used by operators manually, nor are likely to be used by learning machines. This increases data processing, distribution and storage overheads. If we only produce what is useful, then it should improve data flow times (point 5 above). Therefore learning machines should be able to control which data sets they need from network devices and at what cadence. The learning engine can start off collecting all metrics, then progressively turning them off as they deem metrics unnecessary. This could also extend to controlling log-levels (ie how much granularity of data is generated for a particular log, event, performance counter)
  7. Perhaps we even offer AI-as-a-service, whereby any of the components within the black-box can call upon a centralised AI service (and the common data lake that underpins it) to assist with localised self-healing, self-regulation, etc. This facilitates closed-loop decisions throughout the stack rather than just an over-arching command and control mechanism

I’m barely exposing the tip of the iceberg here. I’d love to get your thoughts on what else it will take to bring fully autonomous network to reality.

Net Simplicity Score (NSS) gets a little more complex

In last Tuesday’s post, I asked the community here on PAOSS and on TM Forum’s Engage platform for ideas about how you would benchmark complexity.

I also provided a reference to an old post that described the concept of a NSS (Net Simplicity Score) for our OSS/BSS.

Due to the complexity of factors that contribute to a complexity score, the NSS is a “catch-all” simplicity metric. Hopefully it will allow subtraction projects to be easily justified, just as the NPS (Net Promoter Score) metric has helped justify customer experience initiatives.

The NSS (Net Simplicity Score), could be further broken down into:

  • The NCSS (Net Customer Simplicity Score) – A ranking from 0 (lowest) to 10 (highest) how easy is it to choose and use the company / product / service? This is an external metric (ie the ranking of the level of difficulty that your customers face)
  • The NOSS (Net Operator Simplicity Score) – A ranking from 0 (lowest) to 10 (highest) how easy is it to choose and use the company / product / service? This is an internal metric (ie for operators to rank complexity of systems and their constituent applications / data / processes)

One interesting item of feedback came from Ronald Hasenberger. He rightly pointed out that just because something is simple for users to interact with, doesn’t mean it’s simple behind the scenes – often exactly the opposite. The iPod example I used in earlier posts is a case in point. The iPod was more intuitive than existing MP3 players, but a huge amount of design and engineering went into making it that way. The underlying “system” certainly wasn’t simple.

So perhaps there’s a third simplicity factor to add to the two bullets listed above:

  • The NSSS (Net System Simplicity Score) – and this one does require a more sophisticated algorithm than just an aggregate of perceptions. Not only that, but it’s the one that truly reflects the systems we design and build. I wonder whether the first two are an initial set of proxies that help drive complexity out of our solutions, but we need to develop Ronald’s third one to make the biggest impact?

Again, I’d love to hear your thoughts!

OSS are not just a #$%&ing cost centre

It seems that OSS/BSS are always an afterthought. And always seen as a cost centre rather than a revenue generator.

Now I’m biased of course, but I think that’s such a narrow view. And we need everyone in our industry to spread the same gospel. 

I like to think of it like this… Sales teams identify the customers and revenue (let’s call them THE BUY-SIDE). Network teams build the assets that service the customer needs (let’s call them THE SELL-SIDE). But the OSS/BSS are the profit engine because they bring buyers and sellers together.

They initiate revenues (Fulfillment / Activation workflows), they retain revenues (Assurance workflows) and they can identify, then minimise costs (automations, analytics, leakage management, identify ineffective work practices, identify simplification opportunities, workforce coordination, etc, etc). They also have a strong influence on customer experience.

OSS/BSS operationalise the assets (to deliver the services that customers pay for).

How much revenue do our OSS/BSS operationalise? All of it (unless some orders are for professional services and/or being activated directly on the network without touching OSS or BSS systems)

The OSS/BSS also provide the strategic levers for management to pull in future. In times when long-term competitive advantages are hard to find, your OSS/BSS can give significant competitive advantage (if flexible and effective) or hinder it (if inflexible / unadaptable)

Opinions wanted – How to Benchmark OSS/BSS complexity

I’d love to ask you an important question…  how do we benchmark OSS/BSS complexity? To measure how complex our systems are and therefore provide a signpost for simplification.

A colleague has opined that the number of apps in a stack could be used a proxy. I can see where he’s going with that, but I feel that it doesn’t account for architectural differences such as monolith versus microservices.

I’d love to hear your thoughts via the comments box below.

FWIW, here are some additional thoughts from me, but please don’t let them bias your opinions:

  • For me, complexity relates to the efficiency of getting tasks done:
    • How much time to complete certain tasks
    • How many button clicks
    • How much swivel-chairing
    • How many CPU cycles for automated tasks
    • How much admin overhead
    • How much duplicated effort and/or rework
  • However, there are so many different tasks done within an OSS/BSS stack that it’s difficult to provide a complexity metric that compares one OSS/BSS stack with another. Or compares a single stack before/after changes are made
  • In some cases the complexity happens inside the OSS/BSS “black box” (eg tools within the suite aren’t seamlessly integrated, causing operators to perform dual-entry that leads to data inconsistency and downstream re-work)
  • In other cases the complexity is inherited from outside the black box (eg product offerings have hundreds of possible variants that are imperceptibly different in the customer’s eyes). I call this The OSS Pyramid of Pain
  • In many cases, the complexity of an OSS/BSS stack is less about the systems and integrations, and more about the complexity of The Decision Tree that spans the stack. The spread of the Decision Tree is impacted by:
    • The OSS/BSS applications
    • Support applications (eg authentication, security, data management, resilience / availability, etc)
    • System interfaces (internal and external)
    • User interfaces
    • Process designs
    • Product definitions
    • Work practices
    • Data models
    • Design rules
    • Network topologies
    • etc, etc
  • The more complex the Decision Tree, the more complex it is to transform our OSS. It loosely aligns with what I call this The Chessboard Analogy
  • The development strategy used also has an impact, be monolithic, best-of-breed, hosted or in-house developed. For example an in-house-developed solution is likely to have less functionality-bloat than a COTS (off-the-shelf) solution. The COTS solution needs to include additional functionality to enable it to support requirements of multiple customers
  • And finally, a benchmark is only as useful as the actions that it triggers. How do we codify a complexity metric that has the equally complex array of contributions described above?
  • Perhaps we could take a somewhat abstracted approach like the NPS (Net Promoter Score) does, thus creating a NSS (Net Simplicity Score)

As mentioned above, I’d love to hear your thoughts on how we can benchmark the level of complexity in our OSS/BSS. Please leave your comments below.