Softwarisation of 5G

As you have undoubtedly noticed, 5G is generating quite a bit of buzz in telco and OSS circles.

For many it’s just an n+1 generation of mobile standards, where n is currently 4 (well, the number of recent introductions into the market mean n is probably now getting closer to 5  🙂  ).

But 5G introduces some fairly big changes from an OSS perspective. As usual with network transformations / innovations, OSS/BSS are key to operationalisation (ie monetising) the tech. This report from TM Forum suggests that more than 60% of revenues from 5G use-cases will be dependent on OSS/BSS transformation.

And this great image from the 5G PPP Architecture Working Group shows how the 5G architecture becomes a lot more software-driven than previous architectures. Interesting how all 5 “software dimensions” are the domain of our OSS/BSS isn’t it? We could replace “5G architecture” with “OSS/BSS” in the diagram below and it wouldn’t feel out of place at all.

So, you may be wondering in what ways 5G will impact our OSS/BSS:

  • Network slicing – being able to carve up the network virtually, to generate network slices that are completely different functionally, means operators will be able to offer tailored, premium service offerings to different clients. This differs from the one-size-fits-all approach being used previously. However, this means that the OSS/BSS complexity gets harder. It’s almost like you need an OSS/BSS stack for each network slice. Unless we can create massive operational efficiencies through automation, the cost to run the network will increase significantly. Definitely a no-no for the execs!!
  • Fibre deeper – since 5G will introduce increased cell density in many locations, and offer high throughput services, we’ll need to push fibre deeper into the network to support all those nano-cells, pico-cells, etc. That means an increased reliance on good outside plant (PNI – Physical Network Inventory) and workforce management (WFM) tools
  • Software defined networks, virtualisation and virtual infrastructure management (VIM) – since the networks become a lot more software-centric, that means there are more layers (and complexity) to manage.
  • Mobile Edge Compute (MEC) and virtualisation – 5G will help to serve use-cases that may need more compute at the edge of the radio network (ie base stations and cell sites). This means more cross-domain orchestration for our OSS/BSS to coordinate
  • And other use-cases where OSS/BSS will contribute including:
    • Multi-tenancy to support new business models
    • Programmability of disparate networks to create a homogenised solution (access, aggregation, core, mobile edge, satellite, IoT, cloud, etc)
    • Self-healing automations
    • Energy efficiency optimisation
    • Monitoring end-user experience
    • Zero-touch administration aspirations
    • Drone survey and augmented reality asset management
    • etc, etc

Fun times ahead for OSS transformations! I just hope we can keep up and allow the operator market to get everything it wants / needs from the possibilities of 5G.

An Asset Management / Inventory trick

Last week we discussed the nuances between Inventory, Asset and Config Management within an OSS stack. Each one of these tools are designed to supports functionality for different users / persona-groups. However, they also tend to have significant functional overlap. Chances are your organisation doesn’t have separate dedicated tools for each.

So today I’m going to share a trick I’ve used in the past when I’ve only had a PNI (Physical Network Inventory) system to work with, but need to perform asset management style functionality.

Most inventory tools are great at storing the current state of a device that exists in a network. However, they don’t tend to be so great at an asset manager’s primary function – tracking the entire life-cycle of an asset from procurement to decommissioning and sparing / maintenance along the way.

Normally the PNI just records the locations of all the active network equipment – in buildings, exchanges, comms-huts, cabinets, etc. The trick I use is to create an additional location/s for warehouses. They may (or may not) reside in the physical location of your real warehouse/s.

In almost all PNI systems, you have control over the status of the device (eg IN-SERVICE, etc). You can use this functionality to include status of SPARE, UNDER REPAIR, etc and switch a device between active network locations and the warehouse.

These status-change records give you the ability to pin-point the location of a given asset at any point in time. It also gives you trending stats, either as an individual device or as a cohort of devices (eg by make/model).

You can even build processes around it for check-in / check-out of the warehouse and maintenance scheduling.

I should point out that this works if your PNI allows you to uniquely identify a device (eg by make/model + serial number or perhaps a unique naming convention instance). If your PNI device records only show the current function of a device (eg a naming convention like SiteA-Router-0001), then you might lose sight of the device’s trail when it moves through life-cycle states (eg to the warehouse).

Bleak sentiments

People in a tough spot often focus on their own problems, when the answer usually lies in fixing someone else’s.”
Steve Schwarzman.

The telco industry is in a tough spot in many areas around the globe. Sadly, there were more stories of wholesale retrenchments here in Australia this week, including good friends. Revenues falling. Sentiment bleak.

Yet it’s not like most declining industries. Telecom services and remote communications are more in demand than ever before. It’s just that the dynamic has flipped. Previously there was a scarcity mindset around telco services. Now there’s an abundance (in many parts of the world). Connectivity is less of a problem now. Customers have stepped up the hierarchy of needs.

Maybe as industries, telco and OSS, we’re focusing on our own problems? The O in OSS, Operations, can trigger an inward-facing viewpoint (and retrenchments only exacerbate internalisation).

5G isn’t the solution if we’re just looking at it as a new, better model of connectivity. It might be if it can make us better at solving other people’s problems. I suspect OSS will have a big part to play in generating those solutions (or stymieing them!).

OSS discovers a network

Following yesterday’s post about OSS Inventory, I received another great follow-up question from another avid reader of the PAOSS blog:

Interesting thoughts Ryan! In addition to ‘faults up’, perhaps there is a case also (obvious?) for ‘discovery up’ to capture ongoing non-planned changes? Wondering have you come across any sort of reconciliation / adaptive inventory patterns like this? Workflow based? Autonomous? (Going to far into chaos theory territory ?

Yes, we did exactly that with the same tool discussed yesterday that I used back in 2000. In fact, a very clever dev and I got that company’s first-ever auto-discovery tool working on site (using a product supplied by head-office). Discovering the nodals (ie equipment, cards, ports) was fairly easy. Discovering the connectivity within a domain (we started with SDH) was tricky, but achievable. Auto-discovering cross-domain connectivity (ie DSL circuits through physical, SDH transit, ATM and logical connectivity onto the IP cloud) was much trickier as we needed to find/make linking keys across different data sources.

It was definitely workflow based with a routine-driven back-end. We didn’t just want anything that was discovered to be automatically stuffed into (or removed from) the database (think flapping ports or equipment going down temporarily). It could’ve been automous, but we introduced a manual step to approve any of the discoveries made by each automated discovery iteration.

As you know, modern networks / EMS / VIM (resource managers) are much more discoverable. They need to be for modern orchestration and resilience techniques. I don’t think it would be quite so tricky to stitch circuits together as we’re no longer so circuit-oriented as back in 2000.

However, I’d be fascinated to hear from other readers how much of a problem they have trying to marry up different data sources for discovery purposes. I’d also love to hear whether they’re using fully autonomous discovery or using the manual intervention step for the same reason we were. I imagine most are automating, because orchestration plans just need to make use of whatever resources are being presented by the underlying resource managers in near-real-time.

PS. For those wondering what “discovery” is, it’s shown in the lower grey arrow in this diagram from “Orders down, Faults up

Discovery is the process that allows data to be passed from NMS/EMS/NEs (ie the network or resource managers) directly into the inventory management database. It should be a more reliable and expedient way of sychronising the inventory with the live network. 

The reason for the upper grey arrow is because not all networks have APIs that can be “discovered.” Passive equipment like cable joints and patch-panels don’t have programmatic interfaces. Therefore we need to find other ways to get that data into the Inventory Manager.

In need of an OSS transformation translator

As OSS Architects, we have an array of elegant frameworks to call upon when designing our transformational journeys – from current state to a target state architecture.

For example, when providing data mapping, we have tools to prepare current and/or target-state data diagrams such as the following:

Source here.

These diagrams are really elegant and powerful for communicating with other data experts and delivery teams. It’s a data expert language.

Data experts are experts of the ETL (Extract, Transform, Load) process, but often have less expertise with the actual meaning and importance of the data sets themselves. For example, a data expert may know there’s a product offerings table, and each has 23 associated attributes (eg bandwidth, SLA class, etc) available. But they may have less understanding of the 245 product types that are housed in the product’s data table, and even less awareness of the meanings of the thousands of product attributes. You need to be a subject matter expert (SME) to understand that detail about the data. In some cases, the SME might be from your client and knows far more tribal knowledge than you.

We often need other SMEs (the products expert in this case) to help us understand what has to happen with the data during transformation. What do we keep, what do we change, what do we discard, etc.

Just one problem – SMEs might not always speak the same language as the data experts.

As elegant as it is, the data relationships diagram above might not be the most intuitive format for product experts to review and comment.

As with many aspects of Architecture and transformation, if we’re to understand, it’s best to communicate in our audience’s language.

In this case, it might be best to show data mappings as overlays on screenshots that the Product owner is familiar with:

  • From
    • Their current GUI
    • Existing sales order forms
    • Current report templates
  • To
    • Their next-generation GUI
    • New order forms
    • Post-Transform report templates

Such an approach might not look elegant to our data expert colleagues. The question is whether it quickly makes enough sense to the SMEs for you to elicit concise responses from them.

The “right” approach is not always the most effective.

I’d love to hear your tips, tricks and recommendations for speaking / listening in the audience’s language.

Orders down, faults up

As mentioned in a post about Service and Resource Availability last week, I do tend to think of OSS workflows around an “orders down, faults up,” flow direction. And that means customers (services) at the top, network (resources) at the bottom of the (TMN) pyramid.

I also think of inventory (yellow) as the point where Assurance / Faults (blue) and Fulfillment / Orders (purple) collide and enhance, as per the diagram below.

These are highly generic examples, but let’s take a closer look:

Assurance flow (blue) – an alarm or event in the network (NEL/NE layer) pushes up through the stack to the OSS as a fault. The inventory (network / service) helps to enrich the fault with additional information (eg device name, location, connectivity, correlation of this and other faults, etc) to help resolve the fault (either manually by operators or automatically by algorithms). It also helps associate the fault in the network/resource with the customer/s using those resources. This allows notifications to be issued to customers. Note that this simple flow doesn’t reflect examples such as an incident (ie when a customer notices a problem first and calls it in before the OSS has been able to issue a notification).

Fulfillment flow (purple) – a customer places an order (BML/BSS layer or above) and it pushes down through the stack, including changes in the network (NEL/NE layer). Once all the appropriate network changes have been made, the order is ready for use by the customer. Once again, inventory plays an important part, associating customer / service identifiers with suitable resources from the available resource pool. Generally the (customer facing) service orders won’t have the technology-specific details required to actually update the network configurations though. That’s where the inventory often helps to fill in the knowledge gaps and send technology-specific commands down into the network. [See Friday’s post for more information about CFS and RFS definitions and mappings]

Inventory flows (yellow) – an inventory is relevant to assurance and fulfillment flows if the BSS and network / resource layers don’t hold enough information to be fully processed. The enriching information stored by inventory must come from somewhere. Some of it comes from Discovery (usually an automated process of collecting from the network or other sources), or via Manual / Scripted Input (eg physical network designs including patch cables and splicing). Some data (eg splices) just can’t be collected automatically as they relate to passive equipment that has no programmatic interface. This data just has to be created manually.
But arguably the more important inventory data is actually recording the mappings made from customers (services) to network (resources). Inventory solutions  are often where these linking keys / relationships are recorded.

These flows also tend to indicate the direction of data mastery. Whilst the network itself is the source of truth, Fulfillment flows will start at the BSS and customer / service / order data will tend to be mastered there before orders are pushed into the network as provisioning commands. For assurance flows, the network will tend to be the master source of data, but with enrichment along it’s path northbound.

Just keep in mind that there are many exceptions to these examples. Data and processes can flow in many different ways. The diagram above is just useful for helping newcomers to understand some conceptual processes and data models / flows.

The Ineffective OSS Scoreboard Analogy

Imagine for a moment that you’re the coach of a sporting team. You train your team and provide them with a strategy for the game. You send them out onto the court and let them play.

The scoreboard gives you all of the stats about each player. Their points, blocks, tackles, heart-rate, distance covered, errors, etc. But it doesn’t show the total score for each team or the time remaining in the game. 

That’s exactly what most OSS reports and dashboards are like! You receive all of the transactional data (eg alarms, truck-rolls, device performance metrics, etc), but not how you’re collectively tracking towards team objectives (eg growth targets, risk reduction, etc). 

Yes, you could infer whether the team is doing well by reverse engineering the transactional data. Yes, you could then apply strategies against those inferences in the hope that it has a positive impact. But that’s a whole lot of messing around in the chaos of the coach’s box with the scores close (you assume) and the game nearing the end (possibly). You don’t really know when the optimal time is to switch your best players back into the game.

As coach with funding available, would you be asking your support team to give you more transactional tools / data or the objective-based insights?

Does this analogy help articulate the message from the previous two posts (Wed and Thurs)?

PS. What if you wanted to build a coach-bot to replace yourself in the near future? Are you going to build automations that close the feedback loop against transactional data or are you going to be providing feedback that pulls many levers to optimise team objectives?

One big requirement category most OSS can’t meet

We talked yesterday about a range of OSS products that are more outcome-driven than our typically transactional OSS tools. There’s not many of them around at this stage. I refer to them as “data bridge” products.
 
Our typical OSS tools help manage transactions (alarms, activate customers services, etc). They’re generally not so great at (directly) managing objectives such as:
  • Sign up an extra 50,000 customers along the new Southern network corridor this month
  • Optimise allocation of our $10M capital budget to improve average attainable speeds by 20% this financial year
  • Achieve 5% revenue growth in Q3
  • Reduce truck rolls by 10% in the next 6 months
  • Optimal management of the many factors that contribute to churn, thus reducing churn risk by 7% by next March
 
We provide tools to activate the extra 50,000 customers. We also provide reports / dashboards that visualise the numbers of activations. But we don’t tend to include the tools to manage ongoing modelling and option analysis to meet key objectives. Objectives that are generally quantitative and tied to time, cost, etc and possibly locations/regions. 
 
These objectives are often really difficult to model and have multiple inputs. Managing to them requires data that’s changing on a daily basis (or potentially even more often – think of how a single missed truck-roll ripples out through re-calculation of optimal workforce allocation).
 
That requires:
  • Access to data feeds from multiple sources (eg existing OSS, BSS and other sources like data lakes)
  • Near real-time data sets (or at least streaming or regularly updating data feeds)
  • An ability to quickly prepare and compare options (data modelling, possibly using machine-based learning algorithms)
  • Advanced visualisations (by geography, time, budget drawdown and any graph types you can think of)
  • Flexibility in what can be visualised and how it’s presented
  • Methods for delivering closed-loop feedback to optimise towards the objectives (eg RPA)
  • Potentially manage many different transaction-based levers (eg parallel project activities, field workforce allocations, etc) that contribute to rolled-up objectives / targets
 
You can see why I refer to this as a data bridge product right? I figure that it sits above all other data sources and provides the management bridge across them all. 
 
PS. If you want to know the name of the existing products that fit into the “data bridge” category, please leave us a message.

An OSS checksum

Yesterday’s post discussed two waves of decisions stemming from our increasing obsession with data collection.

“…the first wave had [arisen] because we’d almost all prefer to make data-driven decisions (ie decisions based on “proof”) rather than “gut-feel” decisions.

We’re increasingly seeing a second wave come through – to use data not just to identify trends and guide our decisions, but to drive automated actions.”

Unfortunately, the second wave has an even greater need for data correctness / quality than we’ve experienced before.

The first wave allowed for human intervention after the collection of data. That meant human logic could be applied to any unexpected anomalies that appeared.

With the second wave, we don’t have that luxury. It’s all processed by the automation. Even learning algorithms struggle with “dirty data.” Therefore, the data needs to be perfect and the automation’s algorithm needs to flawlessly cope with all expected and unexpected data sets.

Our OSS have always had a dependence on data quality so we’ve responded with sophisticated ways of reconciling and maintaining data. But the human logic buffer afforded a “less than perfect” starting point, as long as we sought to get ever-closer to the “perfection” asymptote.

Does wave 2 require us to solve the problem from a fundamentally different starting point? We have to assume perfection akin to a checksum of correctness.

Perfection isn’t something I’m very qualified at, so I’m open to hearing your ideas. 😉

 

Crossing the OSS chasm

Geoff Moore’s seminal book, “Crossing the Chasm,” described the psychological chasm between early buyers and the mainstream market.

Crossing the Chasm

Seth Godin cites Moore’s work, “Moore’s Crossing the Chasm helped marketers see that while innovation was the tool to reach the small group of early adopters and opinion leaders, it was insufficient to reach the masses. Because the masses don’t want something that’s new, they want something that works…

The lesson is simple:

– Early adopters are thrilled by the new. They seek innovation.

– Everyone else is wary of failure. They seek trust.”
 

I’d reason that almost all significant OSS buyer decisions fall into the “mainstream market” section in the diagram above.  Why? Well, an organisation might have the 15% of innovators / early-adopters conceptualising a new OSS project. However, sign-off of that project usually depends on a team of approvers / sponsors. Statistics suggest that 85% of the team is likely to exist in a mindset beyond the chasm and outweigh the 15%. 

The mainstream mindset is seeking something that works and something they can trust.

But OSS / digital transformation projects are hard to trust. They’re all complex and unique. They often fail to deliver on their promises. They’re rarely reliable or repeatable. They almost all require a leap of faith (and/or a burning platform) for the buyer’s team to proceed.

OSS sellers seek to differentiate from the 400+ other vendors (of course). How do they do this? Interestingly, by pitching their innovations and uniqueness mostly.

Do you see the gap here? The seller is pitching the left side of the chasm and the buyer cohort is on the right.

I wonder whether our infuriatingly lengthy sales cycles (often 12-18 months) could be reduced if only we could engineer our products and projects to be more mainstream, repeatable, reliable and trustworthy, whilst being less risky.

This is such a dilemma though. We desperately need to innovate, to take the industry beyond the chasm. Should we innovate by doing new stuff? Or should we do the old, important stuff in new and vastly improved ways? A bit of both??

Do we improve our products and transformations so that they can be used / performed by novices rather than designed for use by all the massive intellects that our industry seems to currently consist of?

 

 

 

 

Over 30 Autonomous Networking User Stories

The following is a set of user stories I’ve provided to TM Forum to help with their current Autonomous Networking initiative.

They’re just an initial discussion point for others to riff off. We’d love to get your comments, additions and recommended refinements too.

As a Head of Network Operations, I want to Automatically maintain the health of my network (within expected tolerances if necessary) So that Customer service quality is kept to an optimal level with little or no human intervention
As a Head of Network Operations, I want to Ensure the overall solution is designed with machine-led automations as a guiding principle So that Human intervention can not be easily engineered into the systems/processes
As a Head of Network Operations, I want to Automatically identify any failures of resources or services within the entire network So that All relevant data can be collected, logged, codified and earmarked for effective remedial action without human interaction
As a Head of Network Operations, I want to Automatically identify any degradation of resource or service performance within the network So that All relevant data can be collected, logged, codified and earmarked for effective remedial action without human interaction
As a Head of Network Operations, I want to Map each codified data set (for failure or degradation cases) to a remedial action plan So that Remedial activities can be initiated without human interaction
As a Head of Network Operations, I want to Identify which remedial activities can be initiated via a programmatic interface and which activities require manual involvement such as a truck roll So that Even manual activities can be automatically initiated
As a Head of Network Operations, I want to Ensure that automations are able to resolve all known failure / degradation scenarios So that Activities can be initiated for any failure or degradation and be automatically resolved through to closure (with little or no human intervention)
As a Head of Network Operations, I want to Ensure there is sufficient network resilience So that Any failure or degradation can be automatically bypassed (temporarily or permanently)
As a Head of Network Operations, I want to Ensure there is sufficient resilience within all support systems So that Any failure or degradation can be automatically bypassed (temporarily or permanently) to ensure customer service is maintained
As a Head of Network Operations, I want to Ensure that operator initiated changes (eg planned maintenance, software upgrades, etc) automatically generate change tracking, documentation and logging So that The change can be monitored (by systems and humans where necessary) to ensure there is minimal or no impact to customer services, but also to ensure resolution data is consistently recorded
As a Head of Network Operations, I want to Ensure that customer initiated changes (eg by raising an incident) automatically generate change tracking, documentation and logging So that The change can be monitored (by systems and humans where necessary) to ensure the incident is closed expediently, but also to ensure resolution data is consistently recorded
As a Head of Network Operations, I want to Initiate planned outages with or without triggering automated remedial activities So that The change agents can decide to use automations or not and ensure automations don’t adversely effect the activities that are scheduled for the planned outage window
As a Head of Network Operations, I want to Ensure that if an unplanned outage does occur, impacted customers are automatically notified (on first instance and via a communications sequence if necessary throughout the outage window) So that Customer experience can be managed as best possible
As a Head of Network Operations, I want to Ensure that if an unplanned outage does occur without a remedial action being triggered, a post-mortem analysis is initiated So that Automations can be revised to cope with this previously unhandled outage scenario
As a Head of Network Operations, I want to Ensure that even previously un-seen new fail scenarios can be handled by remedial automations So that Customer service quality is kept to an optimal level with little or no human intervention
As a Head of Network Operations, I want to Automatically monitor the effects of remedial actions So that Remedial automations don’t trigger race conditions that result in further degradation and/or downstream impacts
As a Head of Network Operations, I want to Be able to manually override any automations by following a documented sequence of events So that If a race condition is inadvertently triggered by an automation, it can be negated quickly and effectively before causing further degradation
As a Head of Network Operations, I want to Intentionally trigger network/service outages and/or degradations, including cascaded scenarios on an scheduled and/or randomised basis So that The resilience of the network and systems can be thoroughly tested (and improved if necessary)
As a Head of Network Operations, I want to Intentionally trigger network/service outages and/or degradations, including cascaded scenarios on an ad-hoc basis So that The resilience of the network and systems can be thoroughly tested (and improved if necessary)
As a Head of Network Operations, I want to Perform scheduled compliance checks on the network So that Expected configurations and policies are in place across the network
As a Head of Network Operations, I want to Automatically generate scheduled reports relating to the effectiveness of the network, services and automations So that The overall solution health (including automations) can be monitored
As a Head of Network Operations, I want to Automatically generate dashboards (in near-real-time) relating to the effectiveness of the network, services and automations So that The overall solution health (including automations) can be monitored
As a Head of Network Operations, I want to Ensure that automations are able to extend across all domains within the solution So that Remedial actions aren’t constrained by system hand-offs
As a Head of Network Operations, I want to Ensure configuration backups are performed automatically on all relevant systems (eg EMS, OSS, etc) So that A recent good solution configuration can be stored as protection in case automations fail and corrupt configurations within the system
As a Head of Network Operations, I want to Ensure configuration restores are performed and tested automatically on all relevant systems (eg EMS, OSS, etc) So that A recent good solution configuration can be reverted to in case automations fail and corrupt configurations within the system
As a Head of Network Operations, I want to Ensure automations are able to manage the entire service lifecycle (add, modify/upgrade, suspend, restore, delete) So that Customer services can evolve to meet client expectations with little or no human intervention
As a Head of Network Operations, I want to Have a design and architecture that uses intent-based and/or policy-based actions So that The complexity of automations is minimised (eg automations don’t need to consider custom rules for different device makes/models, etc)
As a Head of Network Operations, I want to Ensure as many components of the solution (eg EMS, OSS, customer portals, etc) have programmatic interfaces (even if manual activities are required in back-end processes) So that Automations can initiate remedial actions in near real time
As a Head of Network Operations, I want to Ensure all components and data flows within the solution are securely hardened (eg encryption of data in motion and at rest) So that The power of the autonomous platform can not be leveraged for nefarious purposes
As a Head of Network Operations, I want to Ensure that all required metrics can be automatically sourced from the network / systems in as near real time as feasible / useful So that Automations have the full set of data they need to initiate remedial actions and it is as up-to-date as possible for precise decision-making
As a Head of Network Operations, I want to Use the power of learning machines So that The sophistication and speed of remedial response is faster, more accurate and more reliable than if manual interaction were used
As a Head of Network Operations, I want to Record actual event patterns and replay scenarios offline So that Event clusters and response patterns can be thoroughly tested as part of the certification process prior to being released into production environments
As a Head of Network Operations, I want to Capture metrics that can be cross-referenced against event patterns and remedial actions So that Regressions and/or refinements can improve existing automations (ie continuous retraining of the model)
As a Head of Network Operations, I want to Be able to seed a knowledge base with relevant event/action data, whether the pattern source is from Production, an offline environment, a digital twin environment or other production-like environments So that The database is able to identify real scenarios, even if  scenarios are intentially initiated, but could potentially cause network degradation to a production environment
As a Head of Network Operations, I want to Ensure that programmatic interfaces also allow for revert / rollback capabilities So that Remedial actions that aren’t beneficial can be rolled back to the previous state; OR other remedial actions are performed, allowing the automation to revert to original configuration / state
As a Head of Network Operations, I want to Be able to initiate circuit breakers to override any automations So that If a race condition is inadvertently triggered by an automation, it can be negated quickly and effectively before causing further degradation
As a Head of Network Operations, I want to Manually or automatically generate response-plans (ie documented sequences of activities) for any remedial actions fed back into the system So that Internal (eg quality control) or external (eg regulatory) bodies can review “best-practice” remedial activities at any point in time
As a Head of Network Operations, I want to Intentionally trigger catastrophic network failures (in non-prod environments) So that We can trial many remedial actions until we find an optimal solution to seed the knowledge base with

We use time-stamping in OSS, but what about geo-stamping?

A slightly left-field thought dawned on me the other day and I’d like to hear your thoughts on it.

We all know that almost all telemetry coming out of our networks is time-stamped. Events, syslogs, metrics, etc. That makes perfect sense because we look for time-based ripple-out effects when trying to diagnose issues.

But therefore does it also make sense to geo-stamp telemetry data too? Just as time-based ripple-out is common, so too are geographic / topological (eg nearest neighbour and/or power source) ripple-out effects.

If you want to present telemetry data as a geo/topo overlay, you currently have to enrich the telemetry data set first. Typically that means identifying the device name that’s generating the data and then doing a query on huge inventory databases to find the location and connectivity that corresponds to that device.

It’s usually not a complex query, but consider how much processing power must go into enriching at the enormous scale of telemetry records.

For stationary devices (eg core routers), it might seem a bit absurd adding a fixed geo-code (which has to be manually entered into the device once) to every telemetry record, but it seems computationally far more efficient than data lookups (please correct me if I’m wrong here!). For devices that move around (eg routers on planes), hopefully they already have GPS sensors to provide geo-stamp data.

What do you think? Am I stating a problem that has already been solved and/or is not worth solving? Or does it have merit?

As a network owner….

….I want to make my network so observable, reliable, predictable and repeatable that I don’t need anyone to operate it.

That’s clearly a highly ambitious goal. Probably even unachievable if we say it doesn’t need anyone to run it. But I wonder whether this has to be the starting point we take on behalf of our network operator customers?

If we look at most networks, OSS, BSS, NOC, SOC, etc (I’ll call this whole stack “the black box” in this article), they’ve been designed from the ground up to be human-driven. We’re now looking at ways to automate as many steps of operations as possible.

If we were to instead design the black-box to be machine-driven, how different would it look?

In fact, before we do that, perhaps we have to take two unique perspectives on this question:

  1. Retro-fitting existing black-boxes to increase their autonomy
  2. Designing brand new autonomous black-boxes

I suspect our approaches / architectures will be vastly different.

The first will require a incredibly complex measure, command and control engine to sit over top of the existing black box. It will probably also need to reach into many of the components that make up the black box and exert control over them. This approach has many similarities with what we already do in the OSS world. The only exception would be that we’d need to be a lot more “closed-loop” in our thinking. I should also re-iterate that this is incredibly complex because it inherits an existing “decision tree” of enormous complexity and adds further convolution.

The second approach holds a great deal more promise. However, it will require a vastly different approach on many levels:

  1. We have to take a chainsaw to the decision tree inside the black box. For example:
    • We start by removing as much variability from the network as possible. Think of this like other utilities such as water or power. Our electricity service only has one feed-type for almost all residential and business customers. Yet it still allows us great flexibility in what we plug into it. What if a network operator were to simply offer a “broadband dial-tone” service and end users decide what they overlay on that bit-stream
    • This reduces the “protocol stack” in the network (think of this in terms of the long list of features / tick-boxes on any router’s brochure)
    • As well as reducing network complexity, it drastically reduces the variables an end-user needs to decide from. The operator no longer needs 50 grandfathered, legacy products 
    • This also reduces the decision tree in BSS-related functionality like billing, rating, charging, clearing-house
    • We achieve a (globally?) standardised network services catalog that’s completely independent of vendor offerings
    • We achieve a more standardised set of telemetry data coming from the network
    • In turn, this drives a more standardised and minimal set of service-impact and root-cause analyses
  2. We design data input/output methods and interfaces (to the black box and to any of its constituent components) to have closed-loop immediacy in mind. At the moment we tend to have interfaces that allow us to interrogate the network and push changes into the network separately rather than tasking the network to keep itself within expected operational thresholds
  3. We allow networks to self-regulate and self-heal, not just within a node, but between neighbours without necessarily having to revert to centralised control mechanisms like OSS
  4. All components within the black-box, down to device level, are programmable. [As an aside, we need to consider how to make the physical network more programmable or reconcilable, considering that cables, (most) patch panels, joints, etc don’t have APIs. That’s why the physical network tends to give us the biggest data quality challenges, which ripples out into our ability to automate networks]
  5. End-to-end data flows (ie controls) are to be near-real-time, not constrained by processing lags (eg 15 minute poll cycles, hourly log processing cycles, etc) 
  6. Data minimalism engineering. It’s currently not uncommon for network devices to produce dozens, if not hundreds, of different metrics. Most are never used by operators manually, nor are likely to be used by learning machines. This increases data processing, distribution and storage overheads. If we only produce what is useful, then it should improve data flow times (point 5 above). Therefore learning machines should be able to control which data sets they need from network devices and at what cadence. The learning engine can start off collecting all metrics, then progressively turning them off as they deem metrics unnecessary. This could also extend to controlling log-levels (ie how much granularity of data is generated for a particular log, event, performance counter)
  7. Perhaps we even offer AI-as-a-service, whereby any of the components within the black-box can call upon a centralised AI service (and the common data lake that underpins it) to assist with localised self-healing, self-regulation, etc. This facilitates closed-loop decisions throughout the stack rather than just an over-arching command and control mechanism

I’m barely exposing the tip of the iceberg here. I’d love to get your thoughts on what else it will take to bring fully autonomous network to reality.

For those starting out in OSS product, here’s a tip

For those starting out in product, here’s a tip: Design, Defaults*, Documentation, Details and Delivery really matter in software.”
Jeetu Patel here.

* Note that you can interpret “Defaults” to be Out-Of-The-Box functionality offered by the product.

Let’s break those 5 D-words down and describe why they really matter to the OSS industry shall we?

  • Design – The power of OSS product development tends to lie with engineering, ie the developers. I have huge admiration for the very clever and very talented engineers who create amazing products for us to use, buuutttttt……. I just have one reservation – is there a single OSS company that is design-driven? A single one that’s making intuitive, effective, beautiful experiences for their users? The obvious answer is of course engineering teams hold sway over design teams in OSS – how many OSS vendors even have a dedicated design department??? See this article for more.
  • Defaults – Almost every OSS I know of has an enormous amount of “out-of-the-box” functionality baked in. You could even say that most have too much functionality. There’s functionality that might be really important for one customer but never even used by any of the vendor’s other customers. It just represents bloat for all the other customers, and potentially a distraction for their operators. I’m still bemused to see vendors trying to differentiate by adding obscure new default features rather than optimising for “must-have” functions. See this article for more. However, I must add that I’m starting to see a shift in some OSS. They’re moving away from having baked-in functionality and are moving to more data-repository-driven architectures. Interesting!!
  • Documentation – This is a really interesting factor! Some vendors make almost no documentation available until a prospect becomes a paying customer. Other vendors make their documentation available for the general public online and dedicate significant effort to maintaining their information library. The low-doc approach espoused by Agile could be argued to be reducing document quality. However, it also reduces the chance of producing documentation that nobody will read ever! Personally, I believe vendors like Cisco have earnt a huge competitive advantage (in the networking space moreso than OSS) because of their training / certification (ie CCNA, etc) and self-learning (ie online documentation). See this article for more. As such, I’d tend to err on over-documenting for customer-facing collateral. And perhaps under-documenting for internal-facing collateral unless it’s likely to be used regularly and by many.
  • Details – This is another item where there are two ends to the spectrum. That might surprise some people who would claim that attention to detail is paramount. Well, yes…. in many cases, but certainly not all on OSS projects. Let me share a story on attention to detail on a past OSS project. And another story on seeking perfection. Sometimes we just need to find the right balance, and knowing when to prioritise resilience and when to favour precision becomes an art.
  • Delivery – I have two perspectives on this D-word. Firstly, the Steve Jobs inspired quote of “Real artists ship!” In other words, to laud the skill of shipping a product that provides value to the customer rather than holding off on a not-yet-perfected solution. But the second case is probably more important. OSS projects tend to be massive and complex transformation efforts. Our OSS are rarely self-installed like office software, so they require big delivery teams. Some products are easy to deliver/deploy. Others are a *&$%#! If you’re a product developer, please get out in the trenches with your delivery teams and find ways to make their job easier and/or more repeatable.

Net Simplicity Score (NSS) gets a little more complex

In last Tuesday’s post, I asked the community here on PAOSS and on TM Forum’s Engage platform for ideas about how you would benchmark complexity.

I also provided a reference to an old post that described the concept of a NSS (Net Simplicity Score) for our OSS/BSS.

Due to the complexity of factors that contribute to a complexity score, the NSS is a “catch-all” simplicity metric. Hopefully it will allow subtraction projects to be easily justified, just as the NPS (Net Promoter Score) metric has helped justify customer experience initiatives.

The NSS (Net Simplicity Score), could be further broken down into:

  • The NCSS (Net Customer Simplicity Score) – A ranking from 0 (lowest) to 10 (highest) how easy is it to choose and use the company / product / service? This is an external metric (ie the ranking of the level of difficulty that your customers face)
  • The NOSS (Net Operator Simplicity Score) – A ranking from 0 (lowest) to 10 (highest) how easy is it to choose and use the company / product / service? This is an internal metric (ie for operators to rank complexity of systems and their constituent applications / data / processes)

One interesting item of feedback came from Ronald Hasenberger. He rightly pointed out that just because something is simple for users to interact with, doesn’t mean it’s simple behind the scenes – often exactly the opposite. The iPod example I used in earlier posts is a case in point. The iPod was more intuitive than existing MP3 players, but a huge amount of design and engineering went into making it that way. The underlying “system” certainly wasn’t simple.

So perhaps there’s a third simplicity factor to add to the two bullets listed above:

  • The NSSS (Net System Simplicity Score) – and this one does require a more sophisticated algorithm than just an aggregate of perceptions. Not only that, but it’s the one that truly reflects the systems we design and build. I wonder whether the first two are an initial set of proxies that help drive complexity out of our solutions, but we need to develop Ronald’s third one to make the biggest impact?

Again, I’d love to hear your thoughts!

Opinions wanted – How to Benchmark OSS/BSS complexity

I’d love to ask you an important question…  how do we benchmark OSS/BSS complexity? To measure how complex our systems are and therefore provide a signpost for simplification.

A colleague has opined that the number of apps in a stack could be used a proxy. I can see where he’s going with that, but I feel that it doesn’t account for architectural differences such as monolith versus microservices.

I’d love to hear your thoughts via the comments box below.

FWIW, here are some additional thoughts from me, but please don’t let them bias your opinions:

  • For me, complexity relates to the efficiency of getting tasks done:
    • How much time to complete certain tasks
    • How many button clicks
    • How much swivel-chairing
    • How many CPU cycles for automated tasks
    • How much admin overhead
    • How much duplicated effort and/or rework
  • However, there are so many different tasks done within an OSS/BSS stack that it’s difficult to provide a complexity metric that compares one OSS/BSS stack with another. Or compares a single stack before/after changes are made
  • In some cases the complexity happens inside the OSS/BSS “black box” (eg tools within the suite aren’t seamlessly integrated, causing operators to perform dual-entry that leads to data inconsistency and downstream re-work)
  • In other cases the complexity is inherited from outside the black box (eg product offerings have hundreds of possible variants that are imperceptibly different in the customer’s eyes). I call this The OSS Pyramid of Pain
  • In many cases, the complexity of an OSS/BSS stack is less about the systems and integrations, and more about the complexity of The Decision Tree that spans the stack. The spread of the Decision Tree is impacted by:
    • The OSS/BSS applications
    • Support applications (eg authentication, security, data management, resilience / availability, etc)
    • System interfaces (internal and external)
    • User interfaces
    • Process designs
    • Product definitions
    • Work practices
    • Data models
    • Design rules
    • Network topologies
    • etc, etc
  • The more complex the Decision Tree, the more complex it is to transform our OSS. It loosely aligns with what I call this The Chessboard Analogy
  • The development strategy used also has an impact, be monolithic, best-of-breed, hosted or in-house developed. For example an in-house-developed solution is likely to have less functionality-bloat than a COTS (off-the-shelf) solution. The COTS solution needs to include additional functionality to enable it to support requirements of multiple customers
  • And finally, a benchmark is only as useful as the actions that it triggers. How do we codify a complexity metric that has the equally complex array of contributions described above?
  • Perhaps we could take a somewhat abstracted approach like the NPS (Net Promoter Score) does, thus creating a NSS (Net Simplicity Score)

As mentioned above, I’d love to hear your thoughts on how we can benchmark the level of complexity in our OSS/BSS. Please leave your comments below.

 

The digital transformation paradox twins

There’s an old adage that “the confused mind always says no.”

Consider this from your own perspective. If you’re in a state of confusion about something, are you likely to commit wholeheartedly or will you look to delay / procrastinate?

The paradox for digital transformation is that our projects are almost always complex, but complexity breeds confusion and uncertainty. Transformation may be urgently needed, but it’s really hard to persuade stakeholders and sponsors to commit to change if they don’t have a clear picture of the way forward.

As change agents, we face another paradox. It’s our task to simplify the messaging. but our messaging should not imply that the project will be simple. That will just set unrealistic expectations for our stakeholders (“but this project was supposed to be simple,” they say).

Like all paradoxes, there’s no perfect solution. However, one technique that I’ve found to be useful is to narrow down the choices. Not by discarding them outright, but by figuring out filters – ways to quick include or exclude branches of the decision tree.

Let’s take the example of OSS vendor selection. An organisation asks itself, “what is the best-fit OSS/BSS for our needs?” The Blue Book OSS/BSS Vendor Directory will show that there are well over 400 OSS/BSS providers to choose from. Confusion!

So let’s figure out what our needs are. We could dive into really detailed requirement gathering, but that in itself requires many complex decisions. What if we instead just use a few broad needs as our first line of filtering? We know we need an outside plant management tool. Our list of 400+ now becomes 20. There’s still confusion, but we’re now more targeted.

But 20 is still a lot to choose from. A slightly deeper level of filtering should allow us to get to a short list of 3-5. The next step is to test those 3-5 to see which does the best at fulfilling the most important needs of the organisation. Chances are that the best-fit won’t fulfil every requirement, but generally it will clearly fulfil more than any of the other alternatives. It’s best-fit, not perfect fit.

We haven’t made the project less complex, but we have simplified the decision. We’ve arrived at the “best” option, so the way forward should be clear right?

Unfortunately, it’s not always that easy. Even though the best way forward has been identified, there’s still uncertainties in the minds of stakeholders caused purely by the complexity of the upcoming project. I’ve seen examples where the choice of vendor has been clear, with the best-fit clearly surpassing the next-best, but the buyer is still indecisive. I completely get it. Our task as change agents is to reduce doubts and increase transformation confidence.

What will get your CEO fired? (part 4)

In Monday’s article, we suggested that the three technical factors that could get the big boss fired are probably only limited to:

  1. Repeated and/or catastrophic failure (of network, systems, etc)
  2. Inability to serve the market (eg offerings, capacity, etc)
  3. Inability to operate network assets profitably

In that article, we looked closely at a human factor and how current trends of open-source, Agile and microservices might actually exacerbate it. In yesterday’s article we looked at market-serving factors for us to investigate and monitor.

But let’s look at point 3 today. The profitability factors we could consider that reduce the chances of the big boss getting fired are:

  1. Ability to see revenues in near-real-time (revenues are relatively easy to collect, so we use these numbers a lot. Much harder are profitability measures because of the shared allocation of fixed costs)

  2. Ability to see cost breakdown (particularly which parts of the technical solution are most costly, such as what device types / topologies are failing most often)

  3. Ability to measure profitability by product type, customer, etc

  4. Are there more profitable or cost-effective solutions available

  5. Is there greater profitability that could be unlocked by simplification

Exactly what is an OSS’s “intuition age”

I’m currently reading a book entitled, “Jony Ive. The genius behind Apple’s greatest products.”

I’d like to share a paragraph with you from it (and probably expect a few more in coming days):

“…Apple’s internal culture heavily favored the engineers within the product groups. The design process was engineering driven. In the early days of Frog Design, the engineers had bent over backward to help implement the design team’s ambitions, but now the power had shifted. The different engineering groups gave their products in development to Brunner’s group, who were expected to merely “skin” them.

Brunner wanted to shift the power from engineering to design. He started thinking strategically… The idea was to get ahead of the engineering groups and start to make Apple more of a design-driven company rather than a marketing or engineering one.”

That’s an unbelievably insightful conclusion Robert Brunner made. If he wanted to turn Apple into a design-driven company, then he’d have to prepare design concepts that looked further into the future than where the engineers were up to. Products like the iPod and iPad are testimony that Brunner’s strategy worked.

We face the same situation in OSS today. The power of product development tends to lie with engineering, ie the developers. I have huge admiration for the very clever and very talented engineers who create amazing products for us to use, buuutttttt…….

I just have one reservation – is there a single OSS company that is design-driven? A single one that’s making intuitive, effective, beautiful experiences for their users? Of course engineering holds power over design in OSS – how many OSS vendors even have a dedicated design department???

Let me give a comparison (albeit a slightly unfair one). Both of my children were reasonably adept at navigating their way around our iPad (for multiple use cases) by the age of three. What would the equivalent “intuition age” be for navigating our OSS?

If you’re a product manager, have you ever tried it? Have you ever considered benchmarking it (or an equivalent usability metric) and seeing what you could do to improve it for your OSS products?

Interesting metrics from The Blue Book launch

When I first started the Passionate About OSS site / blog many years ago, I was lucky to get a handful of views per day. It’s grown by many multiples since then, fortunately.

The launch of The Blue Book OSS/BSS Vendor Directory generated some exciting metrics yesterday. The directory alone came within 5 pageviews of the highest count we’ve ever seen on PassionateAboutOSS.com (and PAOSS is up to nearly 2,500 posts now). That total appeared in only a 14-hour window because we didn’t go live with The Directory or metric collection until ~10am local time! The graphs are indicating that we should easily exceed PAOSS’s best ever count today.

If you were one of the many viewers who popped in from all around the world to look at The Directory, thank you! If you have any suggested improvements, we’d love to hear from you as we’re sure to be making many further tweaks in coming days/months.

But the most interesting fact about the launch yesterday was that a job posting appeared on UpWork to scrape all the data we’ve presented. On our very first day!! In fact a gentleman in the US reviewed bids and awarded the UpWork job all within about 14 hours of go-live.

That’s positive news because it means that at least one person must’ve thought the data was useful. 🙂