I’m currently reading a book entitled, “Jony Ive. The genius behind Apple’s greatest products.”
I’d like to share a paragraph with you from it (and probably expect a few more in coming days):
“…Apple’s internal culture heavily favored the engineers within the product groups. The design process was engineering driven. In the early days of Frog Design, the engineers had bent over backward to help implement the design team’s ambitions, but now the power had shifted. The different engineering groups gave their products in development to Brunner’s group, who were expected to merely “skin” them.
Brunner wanted to shift the power from engineering to design. He started thinking strategically… The idea was to get ahead of the engineering groups and start to make Apple more of a design-driven company rather than a marketing or engineering one.”
That’s an unbelievably insightful conclusion Robert Brunner made. If he wanted to turn Apple into a design-driven company, then he’d have to prepare design concepts that looked further into the future than where the engineers were up to. Products like the iPod and iPad are testimony that Brunner’s strategy worked.
We face the same situation in OSS today. The power of product development tends to lie with engineering, ie the developers. I have huge admiration for the very clever and very talented engineers who create amazing products for us to use, buuutttttt…….
I just have one reservation – is there a single OSS company that is design-driven? A single one that’s making intuitive, effective, beautiful experiences for their users? Of course engineering holds power over design in OSS – how many OSS vendors even have a dedicated design department???
Let me give a comparison (albeit a slightly unfair one). Both of my children were reasonably adept at navigating their way around our iPad (for multiple use cases) by the age of three. What would the equivalent “intuition age” be for navigating our OSS?
If you’re a product manager, have you ever tried it? Have you ever considered benchmarking it (or an equivalent usability metric) and seeing what you could do to improve it for your OSS products?
We’re going to look into assurance models of the past versus the changing assurance demands that are appearing these days. The diagrams below are highly stylised for discussion purposes so they’re unlikely to reflect actual implementations, but we’ll get to that.
Under the old model, the heart of the OSS/BSS was the database (almost exclusively a relational database). It would gather data, via probes/MDDs/collectors, from the network under management (Note: I’ve shown the sources as devices, but they could equally be EMS/NMS). The mediation device drivers (MDDs) would take feeds from the network and homogenise them to be suitable for very precise loading into tables in the relational databases.
This data came in the form of alarms/events, performance counters, syslogs and not much else. It could come in all sorts of common (eg SNMP) or obscure forms / protocols. Some would come via near-real-time notifications, but a lot was polled at cycles such as 5 or 15 mins.
Then the OSS/BSS applications (eg Assurance, Inventory, etc) would consume data from the database and write other data back to the database. There were some automations, such as hard-coded suppression rules, etc.
The automations were rarely closed-loop (ie to actually resolve the assurance challenge). There were also software assistants such as trendlines and threshold alerts to help capacity planners.
There was little overlap into security assurance – occasionally there might have even been a notification of device configuration varying from a golden config or indirect indicators through performance graphs / thresholds.
But so many aspects of the old world have been changing within our networks and IT systems. The active network, the resilience mechanisms, the level of virtualisation, the release management methods, containerisation, microservices, etc. The list goes on and the demands have become more complex, but also far more dynamic.
Let’s start with the data sources this time, because this impacts our choice of data storage mechanism. We still receive data from the active network devices (and EMS/NMS), but we also now source data from other sources. They might be internal sources from IT, security, etc, but could also be external sources like social indicators. The 4 Vs of data between old and new models have fundamentally changed:
Volume – we’re seeing far more data
Variety – the sources are increasing and the structure of data is no longer as homogenised as it once was (in fact unstructured data is now commonplace)
Velocity – we’re receiving incoming data at any number of different velocities, often at far higher frequency than the 15 minute poll cycles of the past
Veracity (or trustworthiness) – our systems of old relied on highly dependable data due to its relational nature and could easily become a data death spiral if data quality deteriorated. Now we accept data with questionable integrity and need to work around it
Again the data storage mechanism is at the heart of the solution. In this case it’s a (probably) unstructured data lake rather than a relational database because of the 4 Vs above. The data that it stores must still be stored in a way that allows cross-referencing to happen with other data sets (ie the role of the indexer), but not as homogenised as a relational database.
The 4 Vs also fundamentally change the way we have to make use of the data. It surpasses our ability to process in a manual or semi-manual way (where semi-manual implies the traditional rules-based automations like suppression, root-cause analysis, etc). We have no choice but to increase dependency on machine-driven tools as automations need to become:
More closed-loop in nature – that is, to not just consolidate and create a ticket, but also to automate the resolution and ticket closure
More abundant – doing even more of the mundane, recurring tasks like auto-sizing resources (particularly virtual environments), restarting services, scheduling services, log clean-up, etc
To be honest, we probably passed the manual/semi-manual tipping point many years ago. In the meantime we’ve done as best we could, eagerly waiting until the machine tools like ML (Machine Learning) and AI (Artificial Intelligence) could catch up and help out. This is still fairly nascent, but AIOps tools are becoming increasingly prevalent.
The exciting thing is that once we start harnessing the potential of these machine tools, our AIOps should allow us to ask far more than just network health questions like past models. They could allow us to ask marketing / cost / capacity / profitability / security questions like:
Where do I urgently need to increase capacity in the network (and can you automatically just make this happen – a more “just in time” provisioning of resources rather than planning months ahead)
Where could I re-position capacity around the network to reduce costs, improve performance, improve customer experience, meet unmet demand
Where should sales/marketing teams focus their efforts to best service unmet demand (could be based on current network or in sequence with network build-out that’s due to become ready-for-service)
Where are the areas of un-met demand compared with our current network footprint
With an available budget of $x, is it best spent on which ratio of maintenance, replacement, expansion and where
How do we better understand profitability vectors in the network compared to just the more crude revenue metrics (note that profitability vectors could include service density, amount of maintenance on the supporting infrastructure, customer interactions, churn, etc on a geographic or similar basis)
Where (and how) can we progressively automate a more intent or policy-driven auto-remediation of the network (if we don’t already have a consistent approach to config management)
What policies can we tweak to get better performance from the network on a more real-time basis (eg tweaking QoS policies based on current traffic in various parts of the network)
Can we look at past maintenance trends to automatically set routine maintenance schedules that are customised by device, region, device type, loads, etc rather than using a one-size-fits-all maintenance schedule
Can I determine, on a real-time basis, what services are using which resources to get a true service impact estimate in a dynamic, packet-switched network environment
What configurations (or misconfigurations) in the network pose security vulnerability threats
If a configuration change is identified, can it be automatically audited and reported on (and perhaps even quarantined) before then being authorised (manually or automatically?)
What anomalies are we seeing that could represent security events
Can we utilise end-to-end constructs such as network services, customer services, product lifecycle, device lifecycle, application performance (as well as the traditional network performance) to enhance context and correlation
And so many more that can’t be as easily accommodated by traditional assurance tools
Seems this post from last week has triggered some really interesting debate – Is your service assurance really service assurance?? (Part 5). It was a post that looked into collecting end-to-end service metrics rather than our traditional method of collecting network device events/metrics and trying to reverse-engineer to form a service-level perspective.
Thought I’d give you an update. I’m thinking along the following lines, but admit that I don’t have it all worked out by any means yet:
We need to concept of span like OpenTelemetry does between microservices (in a way, it’s like nearest-neighbour of where each packet is getting pushed).
Note that for us a span is on a service-by-service basis between nodes, not just a network link-by-link basis between nodes
We need to be able to measure the real-time metrics of the performance of each span as well as any events/faults impacting them
One challenge (one of probably many) is how to avoid flooding the data/management planes. Possibly a telemetry beacon at each node that’s aggregating performance/events of each packet passed for each service?? But what aggregation-window / cache-size to use? Still too impossibly huge to process except with ridiculously low sampling rates??
By chaining the spans we get a real-time, end-to-end trace of services and the performance (and real-time snapshot of service-by-service resource usage in a packet-switched network)
How to efficiently get the beacon data to a centralised logging/management point? Send beacons via management plane? Send via data plane? Take an approach similar to Netflow / IPFIX-style protocols?
How to store data for a short period (ie for real-time analysis/reporting) as well as for long periods. Due to volumes, we’d have to apply aging policies to the data, but it would still be valuable for the purpose of mid and long-term SLA, network health, optimisation, capacity management, etc
As you can see, there are still so many wide-open questions about the feasibility of the concept. But getting feedback from multiple very clever people who read this blog is definitely helping! Thank you!!
“There’s the famous quote that if you want to understand how animals live, you don’t go to the zoo, you go to the jungle. The Future Lab has really pioneered that within Lego, and it hasn’t been a theoretical exercise. It’s been a real design-thinking approach to innovation, which we’ve learned an awful lot from.”
Jorgen Vig Knudstorp.
This quote prompted me to ask the question – how many times during OSS implementations had I sought to understand user behaviour at the zoo versus the jungle?
By that, how many times had I simply spoken with the user’s representative on the project team rather than directly with end users? What about the less obvious personas as discussed in this earlier post about user personas? Had I visited the jungles where internal stakeholders like project sponsors, executives, data consumers, etc. or external stakeholders such as end-customers, regulatory bodies, etc go about their daily lives?
I can truthfully, but regretfully, say I’ve spent far more time at OSS zoos than in jungles. This is something I need to redress.
But, at least I can claim to have spent most time in customer-facing roles.
Too many of the product development teams I’ve worked closely with don’t even visit OSS zoos let alone jungles in any given year. They never get close to observing real customers in their native environments.
I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because just as you have the concept of traces / metrics / logs for apps, you similarly have traces / metrics / logs for networks.
In the network world, we’re good at getting metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains) as described earlier in this blog series. And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering logs / events, which seems a bit backward to me.
Take a closer look at the OpenTelemetry link above, which provides an overview of how their team is going to gather application telemetry. With increasing software-ification of our networks (eg SDN / NFV) and the use of microservices / NaaS / APIs in our management stacks, could this actually be our path to the holy grail of service assurance (ie capturing trace data – network service telemetry)?? Is it data plane? Is it control / management plane? Is it something in between?
Note: The “active measurements” approach described in part 3 is slightly compromised in current form, which is why I’m so intrigued by the potential of extending the concepts of OpenTelemetry into our software / virtual networks.
I’d really love your take on this one because I’m sure there are many elements to this that I haven’t thought through yet. Please leave your thoughts on the viability of the approach.
Below are three insightful tables from the Netrounds white paper:
Table 1 looks at the typical components (systems) that service assurance is comprised of. But more interestingly, it looks at the types of questions / challenges each traditional system is designed to resolve. You’ll have noticed that none of them directly answer any service quality questions (except perhaps inventory systems, which can be prone to having sketchy associations between services and the resources they utilise).
Table 2 takes a more data-centric approach. This becomes important when we look at the big picture here – ensuring reliable and effective delivery of customer services. Infrastructure failures are a fact of life, so improved service assurance models of the future will depend on automated and predictive methods… which rely on algorithms that need data. Again, we notice an absence of service-related data sets here (apart from Inventory again). You can see the constraints of the traditional data collection approach can’t you?
Table 3 instead looks at the goals of an ideal service-centric assurance solution. The traditional systems / data are convenient but clearly don’t align well to those goals. They’re constrained by what has been presented in tables 1 and 2. Even the highly touted panaceas of AI and ML are likely to struggle against those constraints.
What if we instead start with Table 3’s assurance of customer services in mind and work our way back? Or even more precisely, what if we start with an objective of perfect availability and performance of every customer service?
That might imply self-healing (automated resolution) and resolution prior to failure (prediction) as well as resilience. But let’s first give our algorithms (and dare I say it, AI/ML techniques) a better chance of success.
Working back – What must the data look like? What should the systems look like? What questions should these new systems be answering?
I just came across an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.” You can find a copy here. It’s well worth a read, so much so that I’ll unpack a few of the concepts it contains in a series of articles this week.
It rightly points out that, “Alarms and fault management are what most people think of when hearing the term service assurance. Classical service assurance systems do fall into this category, as they collect indicators from network devices (such as traps, syslog messages and telemetry data) and try to pinpoint faulty devices and interfaces that need fixing.”
This takes us into the rabbit-hole of what exactly is a service (a rabbit-hole that this article partly covers). But let’s put that aside for a moment and consider a service as being an end-to-end “thing” that a customer uses (and pays for, and therefore assumes will behave as “they” expect).
To borrow again from Netrounds, “… we must be able to measure and report on service KPIs in order to accurately measure network service quality from the end user, or customer, perspective. The KPIs should correspond to the service that the customer is paying for. For example, internet access services should measure network KPIs like loss, latency, jitter, and DNS and HTTP response times; a storage backup service should measure data throughput rate; IPTV should measure video frame loss, video buffer underrun events and channel zapping time; and VoIP should measure Mean Opinion Score (MOS).”
There’s just one problem with traditional assurance measuring techniques (eg traps, syslog messages). They are only an indirect proxy for the customer’s experience (and expectations) with the service they’re paying for. Traditional techniques just report on the links in the chain rather than the integrity of the entire length of chain. We have to look at each broken link and attempt to determine whether the chain’s integrity is actually impaired (considering the “meshing” that protects modern service chains). And if there is impairment, to then determine whose chain is impacted, in what way, and what priority needs to be given to its repair.
If we’re being completely honest, the customer doesn’t care about the chain links, or even their MOS score, only that they couldn’t understand what the person at the other end of the VoIP line was trying to communicate with them.
Exacerbating this further, with increasing dependency on cloud and virtualised resources means that there are more chain links that fall outside our domain of visibility.
So, this thing that we’ve called service assurance for the last few decades might actually be a misnomer. We’ve definitely been monitoring the health of network devices and infrastructure (the links), but we tend to only be able to manage services (the chain) through reverse-engineering – by inference, brute force and wizardry.
Is there another way? Let’s dig further in tomorrow’s post.
This is the third part of a series describing a really exciting analysis I’ve just finished.
Part 1 described how we can turn simple log files into a Sankey diagram that shows real-life process flows (not just a theoretical diagram drawn by BAs and SMEs), like below:
Part 2 described how the logs are broken down into a design tree and how we can assign weightings to each branch based on the data stored in the logs, as below:
I’ve already had lots of great feedback in relation to the Part 1 blog, especially from people who’ve had challenges capturing as-is process. The feedback has been greatly appreciated so I’m looking forward to helping them draw up their flow-charts on the way to helping optimise their process flows.
But that’s just the starting point. Today’s post is where things get really exciting (for me at least). Today we build on part 2 and not just record weightings, but use them to assist future decisions.
We can use the decision tree to “predict forward” and help operators / algorithms make optimal decisions whilst working towards process completion. We can use a feedback loop to steer an operator (or application) down the most optimal branches of the tree (and/or avoid the fall-out variants).
This allows us to create a closed-loop, self-optimising, Decision Support System (DSS), as follows:
Using log data alone, we can perform decision optimisation based on “likelihood of success” or “time to complete” as per the weightings table. If supplemented with additional data, the weightings table could also allow decisions to be optimised by “cost to complete” or many other factors.
The model has the potential to be used in “real-time” mode, using the constant stream of process logs to continually refine and adapt. For example:
If the long-term average of a process path is 1 minute, but there’s currently a problem with and that path is failing, then another path (one that is otherwise slightly less optimised over the long-term), could be used until the first path is repaired
An operator happens to choose a new, more optimal path than has ever been identified previously (the delta function in the diagram). It then sets a new benchmark and informs the new approach via the DSS (Darwinian selection)
If you’re wondering how the DSS could be implemented, I can envisage a few ways:
Using existing RPA (Robotic Process Automation) tools [which are particularly relevant if the workflow box in the diagram above crosses multiple different applications (not just a single monolithic OSS/BSS)]
Providing a feedback path into the functionality of the OSS/BSS and it’s GUI
Via notifications (eg email, Slack, etc) to operators
Via a simple, more manual process like flow diagrams, work instructions, scorecards or similar
This visualisation is exciting because it shows how your processes are actually flowing (or not), as opposed to the theoretical process diagrams that are laboriously created by BAs in conjunction with SMEs. It also shows which branches in the flow are actually being utilised and where inefficiencies are appearing (and are therefore optimisation targets).
Some people have wondered how simple activity logs can be used to show the Sankey diagrams. Hopefully the diagram below helps to describe this. You scan the log data looking for variants / patterns of flows and overlay those onto a map of decision states (DPs). In the diagram above, there are only 3 DPs, but 303 different variants (sounds implausible, but there are many variants that do multiple loops through the 3 states and are therefore considered to be a different variant).
The numbers / weightings you see on the Sankey diagram are the number* of instances (of a single flow type) that have transitioned between two DPs / states.
* Note that this is not the same as the count value that appears in the Weightings table. We’ll get to that in tomorrow’s post when we describe how to use the weightings data for decision support.
We contrasted this with the mechanisms used in most OSS that actually prevent flow-state from occurring. Today I’m going to dive into the work that goes into creating a new design (to activate a customer), and how our current OSS designs / processes inhibit flow.
“Being switched on at all times and expected to pick things up immediately makes us miserable, says [Cal] Newport. “It mismatches with the social circuits in our brain. It makes us feel bad that someone is waiting for us to reply to them. It makes us anxious.”
Because it is so easy to dash off a quick reply on email, Slack or other messaging apps, we feel guilty for not doing so, and there is an expectation that we will do it. This, says Newport, has greatly increased the number of things on people’s plates. “The average knowledge worker is responsible for more things than they were before email. This makes us frenetic. We should be thinking about how to remove the things on their plate, not giving people more to do…
Going cold turkey on email or Slack will only work if there is an alternative in place. Newport suggests, as many others now do, that physical communication is more effective. But the important thing is to encourage a culture where clear communication is the norm.
Newport is advocating for a more linear approach to workflows. People need to completely stop one task in order to fully transition their thought processes to the next one. However, this is hard when we are constantly seeing emails or being reminded about previous tasks. Some of our thoughts are still on the previous work – an effect called attention residue.”
That resonates completely with me. So let’s consider that and look into the collaboration process of a stylised order activation:
Customer places an order via an order-entry portal
Perform SQ (Service Qualification) and Credit Checks, automated processes
Order is broken into work order activities (automated process)
Designer1 picks up design work order activity from activity list and commences outside plant design (cables, pits, pipes). Her design pack includes:
Updating AutoCAD / GIS drawings to show outside plant (new cable in existing pit/pipe, plus lead-in cable)
Updating OSS to show splicing / patching changes
Creates project BoQ (bill of quantities) in a spreadsheet
Designer2 picks up next work order activity from activity list and commences active network design. His design pack includes:
Allocation of CPE (Customer Premises Equipment) from warehouse
Allocation of IP address from ranges available in IPAM (IP address manager)
Configuration plan for CPE and network edge devices
FieldWorkTeamLeader reviews inside plant and outside plant designs and allocates to FieldWorker1. FieldWorker1 is also issued with a printed design pack and the required materials
FieldWorker1 commences build activities and finds out there’s a problem with the design. It indicates splicing the customer lead-in to fibres 1/2, but they appear to already be in use
So, what does FieldWorker1 do next?
The activity list / queue process has worked reasonably well up until this step in the process. It allowed each person to work autonomously, stay in deep focus and in the sequence of their own choosing. But now, FieldWorker1 needs her issue resolved within only a few minutes or must move on to her next job (and next site). That would mean an additional truck-roll, but also annoying the customer who now has to re-schedule and take an additional day off work to open their house for the installer.
FieldWorker1 now needs to collaborate quickly with Designer1, Designer2 and FieldWorkTeamLeader. But most OSS simply don’t provide the tools to do so. The go-forward decision in our example draws upon information from multiple sources (ie AutoCAD drawing, GIS, spreadsheet, design document, IPAM and the OSS). Not only that, but the print-outs given to the field worker don’t reflect real-time changes in data. Nor do they give any up-stream context that might help her resolve this issue.
So FieldWorker1 contacts the designers directly (and separately) via phone.
Designer1 and Designer2 have to leave deep-think mode to respond urgently to the notification from FieldWorker1 and then take minutes to pull up the data. Designer1 and Designer2 have to contact each other about conflicting data sets. Too much time passes. FieldWorker1 moves to her next job.
Our challenge as OSS designers is to create a collaborative workspace that has real-time access to all data (not just the local context as the issue probably lies in data that’s upstream of what’s shown in the design pack). Our workspace must also provide all participants with the tools to engage visually and aurally – to choreograph head-office and on-site resources into “group flow” to resolve the issue.
Even if such tools existed today, the question I still have is how we ensure our designers aren’t interrupted from their all-important deep-think mode. How do we prevent them from having to drop everything multiple times a day/hour? Perhaps the answer is in an organisational structure – where all designers have to cycle through the Design Support function (eg 1 day in a fortnight), to take support calls from field workers and help them resolve issues. It will give designers a greater appreciation for problems occurring in the field and also help them avoid responding to emails, slack messages, etc when in design mode.
Operators have developed many unique understandings of what impacts the health of their networks.
For example, mobile operators know that they have faster maintenance cycles in coastal areas than they do in warm, dry areas (yes, due to rust). Other operators have a high percentage of faults that are power-related. Others are impacted by failures caused by lightning strikes.
Near-real-time weather pattern and lightning strike data is now readily accessible, potentially for use by our OSS.
I was just speaking with one such operator last week who said, “We looked at it [using lightning strike data] but we ended up jumping at shadows most of the time. We actually started… looking for DSLAM alarms which will show us clumps of power failures and strikes, then we investigate those clumps and determine a cause. Sometimes we send out a single truck to collect artifacts, photos of lightning damage to cables, etc.”
That discussion got me wondering about what other lateral approaches are used by operators to assure their networks. For example:
What external data sources do you use (eg meteorology, lightning strike, power feed data from power suppliers or sensors, sensor networks, etc)
Do you use it in proactive or reactive mode (eg to diagnose a fault or to use engineering techniques to prevent faults)
Have you built algorithms (eg root-cause, predictive maintenance, etc) to utilise your external data sources
If so, do those algorithms help establish automated closed-loop detect and response cycles
By measuring and managing, has it created quantifiable improvements in your network health
I’d love to hear about your clever and unique insight-generation ideas. Or even the ideas you’ve proposed that haven’t been built yet.
Let me start today with a question: Does your future OSS/BSS need to be drastically different to what it is today?
Please leave me a comment below, answering yes or no.
I’m going to take a guess that most OSS/BSS experts will answer yes to this question, that our future OSS/BSS will change significantly. It’s the reason I wrote the OSS Call for Innovation manifesto some time back. As great as our OSS/BSS are, there’s still so much need for improvement.
But big improvement needs big change. And big change is scary, as Tom Nolle points out:
“IT vendors, like most vendors, recognize that too much revolution doesn’t sell. You have to creep up on change, get buyers disconnected from the comfortable past and then get them to face not the ultimate future but a future that’s not too frightening.”
Do you feel like we’re already in the midst of a revolution? Cloud computing, web-scaling and virtualisation (of IT and networks) have been partly responsible for it. Agile and continuous integration/delivery models too.
The following diagram shows a “from the moon” level view of how I approach (almost) any new project.
The key to Tom’s quote above is in step 2. Just how far, or how ambitious, into the future are you projecting your required change? Do you even know what that future will look like? After all, the environment we’re operating within is changing so fast. That’s why Tom is suggesting that for many of us, step 2 is just a “creep up on it change.” The gap is essentially small.
The “creep up on it change” means just adding a few new relatively meaningless features at the end of the long tail of functionality. That’s because we’ve already had the most meaningful functionality in our OSS/BSS for decades (eg customer management, product / catalog management, service management, service activation, network / service health management, inventory / resource management, partner management, workforce management, etc). We’ve had the functionality, but that doesn’t mean we’ve perfected the cost or process efficiency of using it.
So let’s say we look at step 2 with a slightly different mindset. Let’s say we don’t try to add any new functionality. We lock that down to what we already have. Instead we do re-factoring and try to pull the efficiency levers, which means changes to:
Platforms (eg cloud computing, web-scaling and virtualisation as well as associated management applications)
Methodologies (eg Agile, DevOps, CI/CD, noting of course that they’re more than just methodologies, but also come with tools, etc)
Process (eg User Experience / User Interfaces [UX/UI], supply chain, business process re-invention, machine-led automations, etc)
It’s harder for most people to visualise what the Step 2 Future State looks like. And if it’s harder to envisage Step 2, how do we then move onto Steps 3 and 4 with confidence?
This is the challenge for OSS/BSS vendors, supplier, integrators and implementers. How do we, “get buyers disconnected from the comfortable past and then get them to face not the ultimate future but a future that’s not too frightening?” And I should point out, that it’s not just buyers we need to get disconnected from the comfortable past, but ourselves, myself definitely included.
I spent some time with a client going through their OSS/BSS yesterday. They’re an Australian telco with a primarily home-grown, browser-based OSS/BSS. One of its features was something I’ve never seen in an OSS/BSS before. But really quite subtle and cool.
They have four tiers of users:
Super-admins (the carrier’s in-house admins),
Standard (their in-house users),
Partners (they use many channel partners to sell their services),
Customer (the end-users of the carrier’s services).
All users have access to the same OSS/BSS, but just with different levels of functionality / visibility, of course.
Anyway, the feature that I thought was really cool was that the super-admins have access to what they call the masquerade function. It allows them to masquerade as any other user on the system without having to log-out / login to other accounts. This allows them to see exactly what each user is seeing and experience exactly what they’re experiencing (notwithstanding any platform or network access differences such as different browsers, response times, etc).
This is clearly helpful for issue resolution, but I feel it’s even more helpful for design, feature release and testing across different personas.
In my experience at least, OSS/BSS builders tend to focus on a primary persona (eg the end-user) and can overlook multi-persona design and testing. The masquerade function can make this task easier.
There’s a famous Zig Ziglar quote that goes something like, “You can have everything in life you want, if you will just help enough other people get what they want.”
You could safely assume that this was written for the individual reader, but there is some truth in it within the OSS context too. For the OSS designer, builder, integrator, does the statement “You can have everything in your OSS you want, if you will just help enough other people get what they want,” apply?
We often just think about the O in OSS – Operations people, when looking for who to help. But OSS/BSS has the ability to impact far wider than just the Ops team/s.
The halcyon days of OSS were probably in the 1990’s to early 2000’s when the term OSS/BSS was at its most sexy and exciting. The big telcos were excitedly spending hundreds of millions of dollars. Those projects were huge… and hugely complex… and hugely fun!
With that level of investment, there was the expectation that the OSS/BSS would help many people. And they did. But the lustre has come off somewhat since then. We’ve helped sooooo many people, but perhaps didn’t help enough people enough. Just speak with anybody involved with an OSS/BSS stack and you’ll hear hints of a large gap that exists between their current state and a desired future state.
Do you mind if I ask two questions?
When you reflect on your OSS activities, do you focus on the technology, the opportunities or the problems
Do you look at the local, day-to-day activities or the broader industry
I tend to find myself focusing on the problems – how to solve them within the daily context on customer challenges, but the broader industry problems when I take the time to reflect, such as writing these blogs.
The part I find interesting is that we still face most of the same problems today that we did back in the 1990’s-2000’s. The same source of risks. We’ve done a fantastic job of helping many people get what they want on their day-to-day activities (the incremental). We still haven’t cracked the big challenges though. That’s why I wrote the OSS Call for Innovation, to articulate what lays ahead of us.
It’s why I’m really excited about two of the concepts we’ve discussed this week:
As the title suggests above, NaaS has the potential to be as big a paradigm shift for networks (and OSS/BSS) as Agile has been for software development.
There are many facets to the Agile story, but for me one of the most important aspects is that it has taken end-to-end (E2E), monolithic thinking and has modularised it. Agile has broken software down into pieces that can be worked on by smaller, more autonomous teams than the methods used prior to it.
The same monolithic, E2E approach pervades the network space currently. If a network operator wants to add a new network type or a new product type/bundle, large project teams must be stood up. And these project teams must tackle E2E complexity, especially across an IT stack that is already a spaghetti of interactions.
But before I dive into the merits of NaaS, let me take you back a few steps, back into the past. Actually, for many operators, it’s not the past, but the current-day model.
As per the orange arrow, customers of all types (Retail, Enterprise and Wholesale) interact with their network operator through BSS (and possibly OSS) tools. [As an aside, see this recent post for a “religious war” discussion on where BSS ends and OSS begins]. The customer engagement occurs (sometimes directly, sometimes indirectly) via BSS tools such as:
Order Entry, Order Management
Product Catalog (Product / Offer Management)
SLA (Service Level Agreement) Management
If the customer wants a new instance of an existing service, then all’s good with the current paradigm. Where things become more challenging is when significant changes occur (as reflected by the yellow arrows in the diagram above).
For example, if any of the following are introduced, there are end-to-end impacts. They necessitate E2E changes to the IT spaghetti and require formation of a project team that includes multiple business units (eg products, marketing, IT, networks, change management to support all the workers impacted by system/process change, etc)
A new product or product bundle is to be taken to market
An end-customer needs a custom offering (especially in the case of managed service offerings for large corporate / government customers)
A new network type is added into the network
System and / or process transformations occur in the IT stack
If we just narrow in on point 3 above, fundamental changes are happening in network technology stacks already. Network virtualisation (SDN/NFV) and 5G are currently generating large investments of time and money. They’re fundamental changes because they also change the shape of our traditional OSS/BSS/IT stacks, as follows.
We now not only have Physical Network Functions (PNF) to manage, but Virtual Network Functions (VNF) as well. In fact it now becomes even more difficult because our IT stacks need to handle PNF and VNF concurrently. Each has their own nuances in terms of over-arching management.
The virtualisation of networks and application infrastructure means that our OSS see greater southbound abstraction. Greater southbound abstraction means we potentially lose E2E visibility of physical infrastructure. Yet we still need to manage E2E change to IT stacks for new products, network types, etc.
The diagram below shows how NaaS changes the paradigm. It de-couples the network service offerings from the network itself. Customer Facing Services (CFS) [as presented by BSS/OSS/NaaS] are de-coupled from Resource Facing Services (RFS) [as presented by the network / domains].
NaaS becomes a “meet-in-the-middle” tool. It effectively de-couples
The products / marketing teams (who generate customer offerings / bundles) from
The networks / operations teams (who design, build and maintain the network).and
The IT teams (who design, build and maintain the IT stack)
It allows product teams to be highly creative with their CFS offerings from the available RFS building blocks. Consider it like Lego. The network / ops teams create the building blocks and the products / marketing teams have huge scope for innovation. The products / marketing teams rarely need to ask for custom building blocks to be made.
You’ll notice that the entire stack shown in the diagram below is far more modular than the diagram above. Being modular makes the network stack more suited to being worked on by smaller autonomous teams. The yellow arrows indicate that modularity, both in terms of the IT stack and in terms of the teams that need to be stood up to make changes. Hence my claim that NaaS is to networks what Agile has been to software.
You will have also noted that NaaS allows the Network / Resource part of this stack to be broken into entirely separate network domains. Separation in terms of IT stacks, management and autonomy. It also allows new domains to be stood up independently, which accommodates the newer virtualised network domains (and their VNFs) as well as platforms such as ONAP.
The NaaS layer comprises:
A TMF standards-based API Gateway
A Master Services Catalog
A common / consistent framework of presentation of all domains
The ramifications of this excites me even more that what’s shown in the diagram above. By offering access to the network via APIs and as a catalog of services, it allows a large developer pool to provide innovative offerings to end customers (as shown in the green box below). It opens up the long tail of innovation that we discussed last week.
Some telcos will open up their NaaS to internal or partner developers. Others are drooling at the prospect of offering network APIs for consumption by the market.
You’ve probably already identified this, but the awesome thing for the developer community is that they can combine services/APIs not just from the telcos but any other third-party providers (eg Netflix, Amazon, Facebook, etc, etc, etc). I could’ve shown these as East-West services in the diagram but decided to keep it simpler.
Developers are not constrained to offering communications services. They can now create / offer higher-order services that also happen to have communications requirements.
If you weren’t already on board with the concept, hopefully this article has convinced you that NaaS will be to networks what Agile has been to software.
Agree or disagree? Leave me a comment below.
PS1. I’ve used the old TMN pyramid as the basis of the diagram to tie the discussion to legacy solutions, not to imply size or emphasis of any of the layers.
PS3. Similarly, the size of the NaaS layer is to bring attention to it rather than to imply it is a monolithic stack in it’s own right. In reality, it is actually a much thinner shim layer architecturally
PS4. The analogy between NaaS and Agile is to show similarities, not to imply that NaaS replaces Agile. They can definitely be used together
PS5. I’ve used the term IT quite generically (operationally and technically) just to keep the diagram and discussion as simple as possible. In reality, there are many sub-functions like data centre operations, application monitoring, application control, applications development, product owner, etc. These are split differently at each operator.
Back in the earliest days of OSS (and networks for that matter), it was the telcos that generated almost all of the innovation. That effectively limited innovation to being developed by the privileged few, those who worked for the government-owned, monopoly telcos.
But over time, the financial leaders at those telcos felt the costs of their amazing research and development labs outweighed the benefits and shut them down (or starved them at best). OSS (and network) vendors stepped into the void to assume responsibility for most of the innovation. But there was a dilemma for the vendors (and for telcos and consumers too) – they needed to innovate fast enough to win work against their competitors, but slow enough to accrue revenues from the investment in their earlier innovations. And innovation was still being constrained to the privileged few, those who worked for vendors and integrators.
Now, the telcos are increasingly pushing to innovate wider and faster than the current vendor collective can accommodate. It means we have to reach further out to the long-tail of innovators. To open the floor beyond the privileged few. Excitingly, this opportunity appears to be looming.
“How?” you may ask.
Network as a Service (NaaS) and API platform offerings.
If every telco offers consumption of their infrastructure via API, it provides the opportunity for any developer to bundle their own unique offering of products, services, applications, hosting, etc and take it to market. If you’re heading to TM Forum’s Digital Transformation World (DTW) in Nice next week, there are a number of Catalyst projects on display in this space, including:
The challenge for the telcos is in how to support the growth of this model. To foster the vendor market, it was easy enough for the telcos to identify the big suppliers and funnel projects (and funding) through them. But now they have to figure out a funnel that’s segmented at a much smaller scale – to facilitate take-up by the millions of developers globally who might consume their products (network APIs in this case) rather than the hundreds/thousands of large suppliers.
This brings us back to smart contracts and micro-procurement as well as the technologies such as blockchain that support these models. This ties in with another TM Forum initiative to revolutionise the procurement event:
But an additional benefit for the telcos, if and when the NaaS platform model takes hold, is that the developers also become a unpaid salesforce for the telcos. The developers will be responsible for marketing and selling their own bundles, which will drive consumption and revenues on the telcos’ assets.
Exciting new business models and supply chains are bound to evolve out of this long tail of innovation.
All OSS products are excellent these days. And all OSS vendors know what the most important functionality is. They already have those features built into their products. That is, they’ve already added the all-important features at the left side of the graph.
But it also means product teams are tending to only add the relatively unimportant new features to the right edge of the graph (ie inside the red box). Relatively unimportant and therefore delivering minimal differential advantage.
The challenge for users is that there is a huge amount of relatively worthless functionality that they have to navigate around. This tends to make the user interfaces non-intuitive.
But another approach, a product-led differentiator, dawned on me when discussing the many sources of OSS friction in yesterday’s post. What if we asked our product teams to take a focus on designing solutions that remove friction instead of the typical approach of adding features (and complexity)?
Almost every OSS I’m aware of has many areas of friction. It’s what gives the OSS industry a bad name. But what if one vendor reduced friction to levels far less than any other competitor? Would it be a differentiator? I’m quite certain customers would be lining up to buy a frictionless OSS even if it didn’t have every perceivable feature.
Well, would you hire a furniture maker as CEO of an OSS vendor?
At face value, it would seem to be an odd selection right? There doesn’t seem to be much commonality between furniture and OSS does there? It seems as likely as hiring a furniture maker to be CEO of a car maker?
Oh wait. That did happen.
Ford Motor Company made just such a decision last year when appointing Jim Hackett, a furniture industry veteran, as its CEO. Whether the appointment proves successful or not, it’s interesting that Ford made the decision. But why? To focus on user experience and design as it’s next big differentiator. Clever line of thinking Bill Ford!!
I’ve prepared a slightly light-hearted table for comparison purposes between cars and OSS. Both are worth comparing as they’re both complex feats of human engineering:
Transport passengers between destinations
Operationalise and monetise a comms network
Claimed “Business” justification
Reducing the cost of operations
Operation of common functionality without conscious thought (developed through years of operator practice)
Hmmm??? Depends on which sales person or operator you speak with
Error detection and current-state monitoring
Warning lights and instrument cluster/s
Alarm lists, performance graphs
Key differentiator for customers (1970’s)
Database / CPU size
Key differentiator for customers (2000’s)
Gadgets / functions / cup-holders
Key differentiator for customers (2020+)
Connected car (car as an “experience platform”)
Connected OSS (ie OSS as an experience platform)???
I’d like to focus on three key areas next:
Item 4 and
The transition between items 6 and 7
Item 3 – operating on auto-pilot
If we reference against item 1, the primary objective, experienced operators of cars can navigate from point A to point B with little conscious thought. Key activities such as steering, changing gears and Indicating can be done almost as a background task by our brains whilst doing other mental processing (talking, thinking, listening to podcasts, etc).
Experienced operators of OSS can do primary objectives quickly, but probably not on auto-pilot. There are too many “levers” to pull, too many decisions to make, too many options to choose from, for operators to background-process key OSS activities. The question is, could we re-architect to achieve key objectives more as background processing tasks?
Item 4 – error detection and monitoring
In a car, error detection is also a background task, where operators are rarely notified, only for critical alerts (eg engine light, fuel tank empty, etc). In an OSS, error detection is not a background task. We need full-time staff monitoring all the alarms and alerts popping up on our consoles! Sometimes they scroll off the page too fast for us to even contemplate.
In a car, monitoring is kept to the bare essentials (speedo, tacho, fuel guage, etc). In an OSS, we tend to be great at information overload – we have a billion graphs and are never sure which ones, or which thresholds, actually allow us to operate our “vehicle” effectively. So we show them all.
Transitioning from current to future-state differentiators
In cars, we’ve finally reached peak-cup-holders. Manufacturers know they can no longer differentiate from competitors just by having more cup-holders (at least, I think this claim is true). They’ve also realised that even entry-level cars have an astounding list of features that are only supplementary to the primary objective (see item 1). They now know it’s not the amount of functionality, but how seamlessly and intuitively the users interact with the vehicle on end-to-end tasks. The car is now seen as an extension of the user’s phone rather than vice versa, unlike the recent past.
In OSS, I’ve yet to see a single cup holder (apart from the old gag about CD trays). Vendors mark that down – cup holders could be a good differentiator. But seriously, I’m not sure if we realise the OSS arms race of features is no longer the differentiator. Intuitive end-to-end user experience can be a huge differentiator amongst the sea of complex designs, user interfaces and processes available currently. But nobody seems to be talking about this. Go to any OSS event and we only hear from engineers talking about features. Where are the UX experts talking about innovative new ways for users to interact with machines to achieve primary objectives (see item 1)?
But a functionality arms race isn’t a completely dead differentiator. In cars, there is a horizon of next-level features that can be true differentiators like self-driving or hover-cars. Likewise in OSS, incremental functionality increases aren’t differentiators. However, any vendor that can not just discuss, but can produce next-level capabilities like zero touch assurance (ZTA) and automated O2A (Order to Activate) will definitely hold a competitive advantage.
OSS can be cumbersome at times. Making change can be difficult. We tend to build layers of protections around them and the networks we manage. I get that. Change can be risky (although the protections are often implemented because the OSS and/or network platforms might not be as robust as they could be).
Contrast this with the OSS we want to create. We want to create a platform for rapid innovation, the platform that helps us and our clients generate opportunities and advantages.
For us to build a platform that allows our customers (and their customers) to revolutionise their markets, we might have to consider whether the protective layers around our OSS that are stymying change. Things like firewall burns, change review boards, documentation, approvals, politics, individuals with a reticence to change, etc.
For example, Netflix takes a contrarian, whitelist approach to access by its engineers rather than a blacklist. It assumes that its engineers are professional enough to only use the tools that they need to get their tasks done. They enable their engineers to use commonly off-limits functionality such as adding their own DNS records (ie to support the stand-up of new infrastructure). But they also take a use-it-or-lose-it approach, monitoring the tools that the engineer uses and rescinding access to tools they haven’t used within 90 days. But if they do need access again, it’s as simple as a message on Slack to reinstate it.
This is just one small example of streamlining the platform wrapper. There are probably a million others.
When working on OSS projects as the integrator / installer, I’ve seen many of these “platform wrapper” roadblocks. I’m sure you have too. If you see them as the installer, chances are the ops team you hand over to will also experience these roadblocks.
Question though. Do you flag these platform wrapper roadblocks for improvement, or do you treat them as non-platform and therefore just live with them?
“There’s a fable of a man stuck in a flood. Convinced that God is going to save him, he says no to a passing canoe, boat, and helicopter that offer to help. He dies, and in heaven asks God why He didn’t save him. God says, “I sent you a canoe, a boat, and a helicopter!”
We all have vivid imaginations. We get a goal in our mind and picture the path so clearly. Then it’s hard to stop focusing on that vivid image, to see what else could work.
New technologies make old things easier, and new things possible. That’s why you need to re-evaluate your old dreams to see if new means have come along.”
Derek Sivers, here.
In the past, we could make OSS platform decisions with reasonable confidence that our choices would remain viable for many years. For example, in the 1990s if we decided to build our OSS around a particular brand of relational database then it probably remained a valid choice until after 2010.
But today, there are so many more platforms to choose from, not to mention the technologies that underpin them. And it’s not just the choices currently available but the speed with which new technologies are disrupting the existing tech. In the 1990s, it was a safe bet to use AutoCAD for outside plant visualisation without the risk of heavy re-tooling within a short timeframe.
If making the same decision today, the choices are far less clear-cut. And the risk that your choice will be obsolete within a year or two has skyrocketed.
With the proliferation of open-source projects, the decision has become harder again. That means the skill-base required to service each project has also spread thinner. In turn, decisions for big investments like OSS projects are based more on the critical mass of developers than the functionality available today. If many organisations and individuals have bought into a particular project, you’re more likely to get your new features developed than from a better open-source project that has less community buy-in.
We end up with two ends of a continuum to choose between. We can either chase every new bright shiny object and re-factor for each, or we can plan a course of action and stick to it even if it becomes increasingly obsoleted over time. The reality is that we probably fit somewhere between the two ends of the spectrum.
To be brutally honest I don’t have a solution to this conundrum. The closest technique I can suggest is to design your solution with modularity in mind, as opposed to the monolithic OSS of the past. That’s the small-grid OSS architecture model. It’s easier to replace one building than an entire city.