Would you ever alarm your lab equipment?

Something curious dawned on me the other day – I wondered how many people / organisations actively manage alarms / alerts being generated by their lab equipment?

At first glance, this would seem silly. Lab environments are in constant flux, in all sorts of semi-configured situations, and therefore likely to be alarming their heads off at different times.

As such, it would seem even sillier to send alarms, syslogs, performance counters, etc from your lab boxes through to your production alarm management platform. Events on your lab equipment simply shouldn’t be distracting your NOC teams from handling events on your active network right?

But here’s why I’m curious, in a word, DevOps. Does the DevOps model now mean that some lab equipment stays in a relatively stable state and performs near-mission-critical activities like automated regression testing?? Therefore, does it make sense to selectively monitor / manage some lab equipment?

In what other scenarios might we wish to send lab alarms to the NOC (and I’m not talking about system integration testing type scenarios, but ongoing operational scenarios)?

Using OSS machine learning to predict backwards not forwards

There’s a lot of excitement about what machine-led decisioning can introduce into the world of network operations, and rightly so. Excitement about predictions, automation, efficiency, optimisation, zero-touch assurance, etc.

There are so many use-cases that disruptors are proposing to solve using Artificial Intelligence (AI), Machine Learning (ML) and the like. I might have even been guilty of proposing a few ideas here on the PAOSS blog – ideas like closed-loop OSS learning / optimisation of common processes, the use of Robotic Process Automation (RPA) to reduce swivel-chairing and a few more.

But have you ever noticed that most of these use-cases are forward-looking? That is, using past data to look into the future (eg predictions, improvements, etc). What if we took a slightly different approach and used past data to reconcile what we’ve just done?

AI and ML models tend to rely on lots of consistent data. OSS produce or collect lots of consistent data, so we get a big tick there. The only problem is that a lot of our manually created OSS data is riddled with inconsistency, due to nuances in the way that each worker performs workflows, data entry, etc as well as our own personal biases and perceptions.

For example, zero-touch automation has significant cachet at the moment. To achieve that, we need network health indicators (eg alarms, logs, performance metrics), but also a record of interventions that have (hopefully) improved on those indicators. If we have inconsistent or erroneous trouble-ticketing information, our AI is going to struggle to figure out the right response to automate. Network operations teams tend to want to fix problems quickly, get the ticketing data populated as fast as possible and move on to the next fault, even if that means creating inconsistent ticketing data.

So, to get back to looking backwards, a machine-learning use-case to consider today is to look at what an operator has solved, compare it to past resolutions and either validate the accuracy of their resolution data or even auto-populate fields based on the operator’s actions.

Hat tip to Jay Fenton for the idea seed for this post.

OSS / BSS security getting a little cloudy

Many systems are moving beyond simple virtualization and are being run on dynamic private or even public clouds. CSPs will migrate many to hybrid clouds because of concerns about data security and regulations on where data are stored and processed.
We believe that over the next 15 years, nearly all software systems will migrate to clouds provided by third parties and be whatever cloud native becomes when it matures. They will incorporate many open-source tools and middleware packages, and may include some major open-source platforms or sub-systems (for example, the size of OpenStack or ONAP today)
.”
Dr Mark H Mortensen
in an article entitled, “BSS and OSS are moving to the cloud: Analysys Mason” on Telecom Asia.

Dr Mortensen raises a number of other points relating to cloud models for OSS and BSS in the article linked above. Included is definition of various cloud / virtualisation related terms.

He also rightly points out that many OSS / BSS vendors are seeking to move to virtualised / cloud / as-a-Service delivery models (for reasons including maintainability, scalability, repeatability and other “ilities”).

The part that I find interesting with cloud models (I’ll use the term generically) is positioning of the security control point(s). Let’s start by assuming a scenario where:

  1. The Active Network (AN) is “on-net” – the network that carries live customer traffic (the routers, switches, muxes, etc) is managed by the CSP / operator [Noting though, that these too are possibly managed as virtual entities rather than owned].
  2. The “cloud” OSS/BSS is “off-net” – some vendors will insist on their multi-tenanted OSS/BSS existing within the public cloud

The diagram below shows three separate realms:

  1. The OSS/BSS “in the cloud”
  2. The operator’s enterprise / DC realm
  3. The operator’s active network realm

as well as the Security Control Points (SCPs) between them.

OSS BSS Cloud Security Control Points

The most important consideration of this architecture is that the Active Network remains operational (ie carry customer traffic) even if the link to the DC and/or the link to the cloud is lost.

With that in mind, our second consideration is what aspects of network management need to reside within the AN realm. It’s not just the Active Network devices, but anything else that allows the AN to operate in an isolated state. This means that shared services like NTP / synch needs a presence in the AN realm (even if not of the highest stratum within the operator’s time-synch solution).

What about Element Managers (EMS) that look after the AN devices? How about collectors / probes? How about telemetry data stores? How about network health management tools like alarm and performance management? How about user access management (LDAP, AD, IAM, etc)? Do they exist in the AN or DC realm?

Then if we step up the stack a little to what I refer to as East-West OSS / BSS tools like ticket management, workforce management, even inventory management – do we collect, process, store and manage these within the DC or are we prepared to shift any of this functionality / data out to the cloud? Or do we prefer it to remain in the AN realm and ensure only AN privileged users have access?

Which OSS / BSS tools remain on-net (perhaps as private cloud) and which can (or must) be managed off-net (public cloud)?

Climb further up the stack and we get into the interesting part of cloud offerings. Not only do we potentially have the OSS/BSS (including East-West tools), but more excitingly, we could bring in services or solutions like content from external providers and bundle them with our own offerings.

We often hear about the tight security that’s expected (and offered) as part of the vendor OSS/BSS cloud solutions, but as you see, the tougher consideration for network management architects is actually the end-to-end security and where to position the security control points relative to all the pieces of the OSS/BSS stack.

A new phenomenon for IT

In the past, business-oriented groups have had ideas about what they want to do and then they come to us… Now, they want to know what technology can bring to the table and then they’ll work on the business plan.
So there’s a big gap here. It’s a phenomenon that’s been happening in the last year and it’s an uncomfortable place for IT. We’re not used to having to lead in that way. We have been more in the order-taker business.

Veenod Kurup
, Liberty Global, Group CIO.

That’s a really thought-provoking insight from Veenod isn’t it? Technology driving the business rather than business driving the technology. Technology as the business advantage.

I’ll be honest here – I never thought I’d see that day although I… guess… as e-business increases, the dependency on tech increases in lockstep. I’m passionate about tech, but also of the opinion that the tech is only a means to an end.

So if what Veenod says is reflective of a macro-trend across all industry then he’s right in saying that we’re going to have some very uncomfortable situations for some tech experts. Many will have to widen their field of view from a tech-only vision to a business vision.

Maybe instead the business-oriented groups could just come to the OSS / BSS department and speak with our valuable tripods. After all, we own and run the information and systems where business (BSS) meets technology (OSS) right? Or as previously reiterated on this blog, we have the opportunity to take the initiative and demonstrate the business value within our OSS / BSS.

PS. hat-tip to Dawn Bushaus for unearthing the quote from Veenod in her article on Inform.

Dematerialisation of OSS

In 1972, the Club of Rome in its report The Limits to Growth predicted a steadily increasing demand for material as both economies and populations grew. The report predicted that continually increasing resource demand would eventually lead to an abrupt economic collapse. Studies on material use and economic growth show instead that society is gaining the same economic growth with much less physical material required. Between 1977 and 2001, the amount of material required to meet all needs of Americans fell from 1.18 trillion pounds to 1.08 trillion pounds, even though the country’s population increased by 55 million people. Al Gore similarly noted in 1999 that since 1949, while the economy tripled, the weight of goods produced did not change.
Wikipedia on the topic of Dematerialisation.

The weight of OSS transaction volumes appears to be increasing year on year as we add more stuff to our OSS. The touchpoint explosion is amplifying this further. Luckily, our platforms / middleware, compute, networks and storage have all been scaling as well so the increased weight has not been as noticeable as it might have been (even though we’ve all worked on OSS that have been buckling under the weight of transaction volumes right?).

Does it also make sense that when there is an incremental cost per transaction (eg via the increasingly prevalent cloud or “as a service” offerings) that we pay closer attention to transaction volumes because there is a great perception of cost to us? But not for “internal” transactions where there is little perceived incremental cost?

But it’s not so much the transaction processing volumes that are the problem directly. It’s more by implication. For each additional transaction there’s the risk of a hand-off being missed or mis-mapped or slowing down overall activity processing times. For each additional transaction type, there’s additional mapping, testing and regression testing effort as well as an increased risk of things going wrong.

Do you measure transaction flow volumes across your entire OSS suite? Does it provide an indication of where efficiency optimisation (ie dematerialisation) could occur and guide your re-factoring investments? Does it guide you on process optimisation efforts?

An OSS automation mind-flip

I recently had something of a perspective-flip moment in relation to automation within the realm of OSS.

In the past, I’ve tended to tackle the automation challenge from the perspective of applying automated / scripted responses to tasks that are done manually via the OSS. But it’s dawned on me that I have it around the wrong way! It is an incremental perspective on the main objective of automations – global zero-touch networks.

If we take all of the tasks performed by all of the OSS around the globe, the number of variants is incalculable… which probably means the zero-touch problem is unsolvable (we might be able to solve for many situations, but not all).

The more solvable approach would be to develop a more homogeneous approach to network self-care / self-optimisation. In other words, the majority of the zero-touch challenge is actually handled at the equivalent of EMS level and below (I’m perhaps using out-dated self-healing terminology, but hopefully terminology that’s familiar to readers) and only cross-domain issues bubble up to OSS level.

As the diagram below describes, each layer up abstracts but connects (as described in more detail in “What an OSS shouldn’t do“). That is, each higher layer in the stack reduces the amount if information/control within a domain that it’s responsible for, but it assumes more a broader responsibility for connecting domains together.
OSS abstract and connect

The abstraction process reduces the number of self-healing variants the OSS needs to handle. But to cope with the complexity of self-caring for connected domains, we need a more homogeneous set of health information being presented up from the network.

Whereas the intent model is designed to push actions down into the lower layers with a standardised, simplified language, this would be the reverse – pushing network health knowledge up to higher layers to deal with… in a standard, consistent approach.

And BTW, unlike the pervading approach of today, I’m clearly saying that when unforeseen (or not previously experienced) scenarios appear within a domain, they’re not just kicked up to OSS, but the domains are stoic enough to deal with the situation inside the domain.

Using OSS/BSS to steer the ship

For network operators, our OSS and BSS touch most parts of the business. The network, and the services they carry, are core business so a majority of business units will be contributing to that core business. As such, our OSS and BSS provide many of the metrics used by those business units.

This is a privileged position to be in. We get to see what indicators are most important to the business, as well as the levers used to control those indicators. From this privileged position, we also get to see the aggregated impact of all these KPIs.

In your years of working on OSS / BSS, how many times have you seen key business indicators that are conflicting between business units? They generally become more apparent on cross-team projects where the objectives of one internal team directly conflict with the objectives of another internal team/s.

In theory, a KPI tree can be used to improve consistency and ensure all business units are pulling towards a common objective… [but what if, like most organisations, there are many objectives? Does that mean you have a KPI forest and the trees end up fighting for light?]

But here’s a thought… Have you ever seen an OSS/BSS suite with the ability to easily build KPI trees? I haven’t. I’ve seen thousands of standalone reports containing myriad indicators, but never a consolidated roll-up of metrics. I have seen a few products that show operational metrics rolled-up into a single dashboard, but not business metrics. They appear to have been designed to show an information hierarchy, but not necessarily with KPI trees in mind specifically.

What do you think? Does it make sense for us to offer KPI trees as base product functionality from our reporting modules? Would this functionality help our OSS/BSS add more value back into the businesses we support?

Are today’s platform benefits tomorrow’s constraints?

One of the challenges facing OSS / BSS product designers is which platform/s to tie the roadmap to.

Let’s use a couple of examples.
In the past, most outside plant (OSP) designs were done in AutoCAD, so it made sense to build OSP design tools around AutoCAD. However in making that choice, the OSP developer becomes locked into whatever product directions AutoCAD chooses in future. And if AutoCAD falls out of favour as an OSP design format, the earlier decision to align with it becomes an even bigger challenge to work around.

A newer example is for supply chain related tools to leverage Salesforce. The tools and marketplace benefits make great sense as a platform to leverage today. But it also represents a roadmap lock-in that developers hope will remain beneficial in the years / decades to come.

I haven’t seen an OSS / BSS product built around Facebook, but there are plenty of other tools that have been. Given Facebook’s recent travails, I wonder how many product developers are feeling at risk due to their reliance on the Facebook platform?

The same concept extends to the other platforms we choose – operating systems, programming languages, messaging models, data storage models, continuous development tools, etc, etc.

Anecdotally, it seems that new platforms are coming online at an ever greater rate, so the chance / risk of platform churn is surely much higher than when AutoCAD was chosen back in the 1980s – 1990s.

The question becomes how to leverage today’s platform benefits, whilst minimising dependency and allowing easy swap-out. There’s too much nuance to answer this definitively, but the one key strategy is to try to build key logic models outside the dependent platform (ie ensuring logic isn’t built into AutoCAD or similar).

Designing OSS to cope with greater transience

There are three broad models of networking in use today. The first is the adaptive model where devices exchange peer information to discover routes and destinations. This is how IP networks, including the Internet, work. The second is the static model where destinations and pathways (routes) are explicitly defined in a tabular way, and the final is the central model where destinations and routes are centrally controlled but dynamically set based on policies and conditions.”
Tom Nolle here.

OSS of decades past worked best with static networks. Services / circuits that were predominantly “nailed up” and (relatively) rarely changed after activation. This took the real-time aspect out of play and justified the significant manual effort required to establish a new service / circuit.

However, adaptive and centrally managed networks have come to dominate the landscape now. In fact, I’m currently working on an assignment where DWDM, a technology that was once largely static, is now being augmented to introduce an SDN controller and dynamic routing (at optical level no less!).

This paradigm shift changes the fundamentals of how OSS operate. Apart from the physical layer, network connectivity is now far more transient, so our tools must be able to cope with that. Not only that, but the changes are too frequent to justify the manual effort of the past.

To tie in with yesterday’s post, we are again faced with the option of abstract / generic modelling or specific modelling.

Put another way, we have to either come up with adaptive / algorithmic mechanisms to deal with that transience (the specific model), or need to mimic “nailed-up” concepts (the abstract model).

More on the implications of this tomorrow.

Which OSS tool model do you prefer – Abstract or Specific?

There’s something I’ve noticed about OSS products – they are either designed to be abstract / flexible or they are designed to cater for specific technologies / topologies.

When designed from the abstract perspective, the tools are built around generic core data models. For example, whether a virtual / logical device, a physical device, a customer premises device, a core network device, an access network device, a trasmission device, a security device, etc, they would all be lumped into a single object called a device (but with flexibility to cater for the many differences).

When designed from the specific perspective, the tools are built with exact network and/or service models in mind. That could be 5G, DWDM, physical plant, etc and when any new technology / topology comes along, a new plug-in has to be built to augment the tools.

The big advantage of the abstract approach is obvious (if they truly do have a flexible core design) – you don’t have to do any product upgrades to support a new technology / topology. You just configure the existing system to mimic the new technology / topology.

The first OSS product I worked with had a brilliant core design and was able to cope with any technology / topology that we were faced with. It often just required some lateral thinking to make the new stuff fit into the abstract data objects.

What I also noticed was that operators always wanted to customise the solution so that it became more specific. They were effectively trying to steer the product from an open / abstract / flexible tool-set to the other end of the scale. They generally paid significant amounts to achieve specificity – to mould the product to exactly what their needs were at that precise moment in time.

However, even as an OSS newbie, I found that to be really short-term thinking. A network (and the services it carries) is constantly evolving. Equipment goes end-of-life / end-of-support every few years. Useful life models are usually approx 5-7 years and capital refresh projects tend to be ongoing. Then, of course, vendors are constantly coming up with new products, features, topologies, practices, etc.

Given this constant change, I’d much rather have a set of tools that are abstract and flexible rather than specific but less malleable. More importantly, I’ll always try to ensure that any customisations should still retain the integrity of the abstract and flexible core rather than steering the product towards specificity.

How about you? What are your thoughts?

An OSS conundrum with many perspectives

Even aside from the OSS impact, it illustrates the contrast between “bottom-up” planning of networks (new card X is cheaper/has more ports) and “top down” (what do we need to change to reduce our costs/increase capacity).”
Robert Curran
.

Robert’s quote above is in response to a post called “Trickle-down impact planning.”

Robert makes a really interesting point. Adding a new card type is a relatively common event for a big network operator. It’s a relatively minor challenge for the networks team – a BAU (Business as Usual) activity in fact. But if you follow the breadcrumbs, the impact to other parts of the business can be quite significant.

Your position in your organisation possibly dictates your perspective on the alternative approaches Robert discusses above. Networks, IT, planning, operations, sales, marketing, projects/delivery, executive – all will have different impacts and a different field of view on the conundrum. This makes it an interesting problem to solve – which viewpoint is the “right” one to tackle the challenge from?

My “solutioning” background tends to align with the top down viewpoint, but today we’ll take a look at this from the perspective of how OSS can assist from either direction.

Bottom Up: In an ideal world, our OSS and associated processes would be able to identify a new card (or similar) and just ripple changes out without interface changes. The first OSS I worked on did this really well. However, it was a “single-vendor” solution so the ripples were self-contained (mostly). This is harder to control in the more typical “best-of-breed” OSS stacks of today. There are architectural mechanisms for controlling the ripples out but it’s still a significant challenge to solve. I’d love to hear from you if you’re aware of any vendors or techniques that do this really well.

Top Down: This is where things get interesting. Should top-down impact analysis even be the task of an OSS/BSS? Since it’s a common / BAU operational task, then you could argue it is. If so, how do we create OSS tools* that help with organisational impact / change / options analysis and not just network impact analysis? How do we build the tools* that can:

  1. Predict the rippling impacts
  2. Allow us to estimate the impact of each
  3. Present options (if relevant) and
  4. Provide a cost-benefit comparison to determine whether any of the options are viable for development

* When I say “tools,” this might be a product, but it could just mean a process, data extract, etc.

I have the sense that this type of functionality falls into the category of, “just because you can, doesn’t mean you should… build it into your OSS.” Have you seen an OSS/BSS with this type of impact analysis functionality built-in?

Blown away by one innovation. Now to extend on it

Our most recent two posts, from yesterday and Friday, have talked about one stunningly simple idea that helps to overcome one of OSS‘ biggest challenges – data quality. Those posts have stimulated quite a bit of dialogue and it seems there is some consensus about the cleverness of the idea.

I don’t know if the idea will change the OSS landscape (hopefully), or just continue to be a strong selling point for CROSS Network Intelligence, but it has prompted me to think a little longer about innovating around OSS‘ biggest challenges.

Our standard approach of just adding more coats of process around our problems, or building up layers of incremental improvements isn’t going to solve them any time soon (as indicated in our OSS Call for Innovation). So how?

Firstly, we have to be able to articulate the problems! If we know what they are, perhaps we can then take inspiration from the CROSS innovation to spur us into new ways of thinking?

Our biggest problem is complexity. That has infiltrated almost every aspect of our OSS. There are so many posts about identifying and resolving complexity here on PAOSS that we might skip over that one in this post.

I decided to go back to a very old post that used the Toyota 5-whys approach to identify the real cause of the problems we face in OSS [I probably should update that analysis because I have a whole bunch of additional ideas now, as I’m sure you do too… suggested improvements welcomed BTW].

What do you notice about the root-causes in that 5-whys analysis? Most of the biggest causes aren’t related to system design at all (although there are plenty of problems to fix in that space too!). CROSS has tackled the data quality root-cause, but almost all of the others are human-centric factors – change controls, availability of skilled resources, requirement / objective mis-matches, stakeholder management, etc. Yet, we always seem to see OSS as a technical problem.

How do you fix those people challenges? Ken Segal puts it this way, “When process is king, ideas will never be. It takes only common sense to recognize that the more layers you add to a process, the more watered down the final work will become.” Easier said than done, but a worthy objective!

Blown away by one innovation – a follow-up concept

Last Friday’s blog discussed how I’ve just been blown away by the most elegant OSS innovation I’ve seen in decades.

You can read more detail via the link, but the three major factors in this simple, elegant solution to data quality problems (probably OSS‘ biggest kryptonite) are:

  1. Being able to make connections that break standard object hierarchy rules; but
  2. Having the ability to mark that standard rules haven’t been followed; and
  3. Being able to uses the markers to prioritise the fixing of data at a more convenient time

It’s effectively point 2 that has me most excited. So novel, yet so obvious in hindsight. When doing data migrations in the past, I’ve used confidence flags to indicate what I can rely on and what needs further audit / remediation / cleansing. But the recent demo I saw of the CROSS product is the first time I’ve seen it built into the user interface of an OSS.

This one factor, if it spreads, has the ability to change OSS data quality in the same way that Likes (or equivalent) have changed social media by acting as markers of confidence / quality.

Think about this for a moment – what if everyone who interacts with an OSS GUI had the ability to rank their confidence in any element of data they’re touching, with a mechanism as simple as clicking a like/dislike button (or similar)?

Bad example here but let’s say field techs are given a design pack, and upon arriving at site, find that the design doesn’t match in-situ conditions (eg the fibre pairs they’re expecting to splice a customer lead-in cable to are already carrying live traffic, which they diagnose is due to data problems in an upstream distribution joint). Rather than jeopardising the customer activation window by having to spend hours/days fixing all the trickle-down effects of the distribution joint data, they just mark confidence levels in the vicinity and get the customer connected.

The aggregate of that confidence information is then used to show data quality heat maps and help remediation teams prioritise the areas that they need to work on next. It helps to identify data and process improvements using big circle and/or little circle remediation techniques.

Possibly the most important implication of the in-built ranking system is that everyone in the end-to-end flow, from order takers to designers through to coal-face operators, can better predict whether they need to cater for potential data problems.

Your thoughts?? In what scenarios do you think it could work best, or alternatively, not work?

I’ve just been blown away by the most elegant OSS innovation I’ve seen in decades

Looking back, I now consider myself extremely lucky to have worked with an amazing product on the first OSS project I worked on (all the way back in 2000). And I say amazing because the underlying data models and core product architecture are still better than any other I’ve worked with in the two decades since. The core is the most elegant, simple and powerful I’ve seen to date. Most importantly, the models were designed to cope with any technology, product or service variant that could be modelled as a hierarchy, whether physical or virtual / logical. I never found a technology that couldn’t be modelled into the core product and it required no special overlays to implement a new network model. Sadly, the company no longer exists and the product is languishing on the books of the company that bought out the assets but isn’t leveraging them.

Having been so spoilt on the first assignment, I’ve been slightly underwhelmed by the level of elegant innovation I’ve observed in OSS since. That’s possibly part of the reason for the OSS Call for Innovation published late last year. There have been many exciting innovations introduced since, but many modern tools are still more complex and complicated than they should be, for implementers and operators alike.

But during a product demo last week, I was blown away by an innovation that was so simple in concept, yet so powerful that it is probably the single most impressive innovation I’ve seen since that first OSS. Like any new elegant solution, it left me wondering why it hasn’t been thought of previously. You’re probably wondering what it is. Well first let me start by explaining the problem that it seeks to overcome.

Many inventory-based OSS rely on highly structured and hierarchical data. This is a double-edged sword. Significant inter-relationship of data increases the insight generation opportunities, but the downside is that it can be immensely challenging to get the data right (and to maintain a high-quality data state). Limited data inter-relationships make the project easier to implement, but tend to allow less rich data analyses. In particular, connectivity data (eg circuits, cables, bearers, VPNs, etc) can be a massive challenge because it requires the linking of separate silos of data, often with no linking key. In fact, the data quality problem was probably one of the most significant root-causes of the demise of my first OSS client.

Now getting back to the present. The product feature that blew me away was the first I’ve seen that allows significant inter-relationship of data (yet in a simple data model), but still copes with poor data quality. Let’s say your OSS has a hierarchical data model that comprises Location, Rack, Equipment, Card, Port (or similar) and you have to make a connection from one device’s port to another’s. In most cases, you have to build up the whole pyramid of data perfectly for each device before you can create a customer connection between them. Let’s also say that for one device you have a full pyramid of perfect data, but for the other end, you only know the location.

The simple feature is to connect a port to a location now, or any other point to point on the hierarchy (and clean up the far-end data later on if you wish). It also allows the intermediate hops on the route to be connected at any point in the hierarchy. That’s too simple right, yet most inventory tools don’t allow connections to be made between different levels of their hierarchies. For implementers, data migration / creation / cleansing gets a whole lot simpler with this approach. But what’s even more impressive is that the solution then assigns a data quality ranking to the data that’s just been created. The quality ranking is subsequently considered by tools such as circuit design / routing, impact analysis, etc. However, you’ll have noted that the data quality issue still hasn’t been fixed. That’s correct, so this product then provides the tools that show where quality rankings are lower, thus allowing remediation activities to be prioritised.

If you have an inventory data quality challenge and / or are wondering the name of this product, it’s CROSS, from the team at CROSS Network Intelligence (www.cross-ni.com).

Is your data AI-ready (part 2)

Further to yesterday’s post that posed the question about whether your data was AI ready for virtualised network assurance use cases, I thought I’d raise a few more notes.

The two reasons posed were:

  1. Our data sets haven’t had time to collect much elastic / dynamic network data yet
  2. Our data is riddled with human-generated data that is error-prone

On the latter case in particular, I sense that we’re going to have to completely re-architect the way we collect and store assurance data. We’re almost definitely going to have to think in terms of automated assurance actions and related logging to avoid the errors of human data creation / logging. The question becomes whether it’s worthwhile trying to wrangle all of our old data into formats that the AI engines can cope with or do we just start afresh with new models? (This brings to mind the recent “perfect data” discussion).

It will be one thing to identify patterns, but another thing entirely to identify optimum response activities and to automate those.

If we get these steps right, does it become logical that the NOC (network) and SOC (security operations centre) become conjoined… at least much more so than they tend to be today? In other words, does incident management merge network incidents and security incidents onto common analysis and response platforms? If so, does that imply another complete re-architecture? It certainly changes the operations model.

I’d love to hear your thoughts and predictions.

Are your existing data sets actually suited to seeding an AI engine?

In the virtualization domain, the old root cause technology is becoming obsolete because resources and workloads move around dynamically – we no longer have fixed network and compute resources. Existing service assurance systems in the telecommunication network were designed to manage a fixed set of resources and these assurance systems fall short in monitoring dynamic virtualized networks. Code was written using a rule based approach on known problems. Some advances have been made to develop signature patterns to determine the root cause of a problem, but this approach will also fall short in a dynamic virtualized network where autonomous changes will occur continuously.”
Patrick Kelly
here.

This quote is taken from a really interesting article by Patrick Kelly (see link above).

The old models of determining service impact and root-cause certainly struggle to hold up in the transient world of virtualised networks. Artificial Intelligence or Machine Learning or machine-led pattern identification, or whatever the technologies will be called by their developers, have a really important part to play in networks that are not just dynamic, but undergoing a touchpoint explosion.

The fascinating part of this story is that these clever new models will rely on data. Lots of data. We already have lots of data to feed into the new models. Buuuuut…. I’ve long held the reservation that there might be one slight problem… does all of our existing data actually suit the “AI” models available today?

Firstly, our existing data doesn’t include much of a history on dynamically transient networks. But the more important factor is that our networks have been managed by humans – operators who have a tendency of recording the quickest, dirtiest (and not necessarily correct or complete) set of data that allows them to restore service quickly.

Following a recent discussion with someone who’s running an AI assurance PoC for a big telco, it seems this reservation is turning out to be true. Their existing data sets just aren’t suited to the AI models. They’re having to reconsider their whole approach to their data model and how to collect / store it. They’re now starting to get positive results from the custom-built data sets.

It’s coming back to the same story as a post from last week – having connectors that can translate the different languages of ops, data, AI, etc and building a people / process / technology solution that the AI models can cope with.

You might not be ready to start an AI experiment yet, but you may like to start the journey by understanding whether your existing data is suited to AI modelling. If not, you get the chance to change it and have a great repository of data to seed an AI engine when you are ready in future. The first step on an exponential OSS journey.

Trickle-down impact planning

We introduced the concept of The Trickle-down Effect last year, an effect that sees the most minor changes trickling down through an OSS stack, with much bigger consequences than expected.

The trickle-down effect can be insidious, turning a nice open COTS solution into a beast that needs constant attention to cope with the most minor of operational changes. The more customisations made, the more gnarly the beast tends to be.”

Here’s an example I saw recently. An internal business unit wanted to introduce a new card type into the chassis set they managed. Speaking with the physical inventory team, it seemed the change was quite small and a budget was developed for the works… but the budget (dollars / time / risk) was about to blow out in a big way.

The new card wasn’t being picked up in their fault-management or performance management engines. It wasn’t picked up in key reports, nor was it being identified in the configuration management database or logical inventory. Every one of these systems needed interface changes. Not massive change obviously, but collectively the budget blew out by 10x and expedite changes pushed out the work previously planned by each of the interface development and testing teams.

These trickle-down impacts were known…. by some people…. but weren’t communicated to the business unit responsible for managing the new card type. There’s a possibility that they may not have even added the new card type if they realised the full OSS cost consequences.

Are these trickle-down impacts known and readily communicated within your OSS change processes?

Those who rule perfect data…

A Passionate About OSS article last month spoke of how the investment strategy of a $106 billion VC fund has changed my thinking on our OSS‘ most valuable asset. Masayoshi Son is quoted in that article as follows:

“Those who rule data will rule the entire world. That’s what people of the future will say.”

But one question keeps coming back to me… if you’re ruling poor quality data, will you rule nothing whatsoever?

Along the same lines, the old adage, “practice makes perfect,” is not very helpful if you’re not practicing in a constructive way. A better (albeit somewhat impossible) variant on the adage would be “PERFECT practice makes perfect.”

Let me share an example. There is a product that is completely ground-breaking in its ability to automate and optimise designs of large-scale network roll-outs – designs that include outside plant and access network technologies. In bake-offs with some of the best available network designers, this product and its algorithm consistently beats the humans by far more than 25% (when measured by capital costs, implementation time and various other metrics).

Its one challenge in taking over the world and automating every future network design is having a base set of data that is so perfect that no re-design work is required. For example, if the base data says a duct route is available and has capacity for inserting a cable, then the product assumes it can use the duct in its optimal design. But when the field techs arrive at site, they find the duct is too badly damaged to use or already filled to capacity with other cables that can’t be overhauled. A new optimal design has to be calculated to consider the lack of availability of that duct.

The tool still gives great results, even after all the manual intervention, but perfect source data would give breathtaking results.

So I’d look to make one small tweak to Masayoshi Son’s quote. “Those who rule PERFECT* data will rule the entire world. That’s what people of the future will say.”

* whereby perfect means as high in quality as realistically possible.

So, perhaps those expensive data audits and cumbersome data quality processes will have a far greater ROI (Return on Investment) in future than any of us could ever estimate.

The pruning saw technique for OSS fall-out management

Many different user journeys flow through our OSS every day. These include external / customer journeys, internal / operator journeys and possibly even machines-to-machine or system journeys. Unfortunately, not all of these journeys are correctly completed through to resolution.

The incomplete or unsatisfactory journeys could include inter-system fall-outs, customer complaints, service quality issues, and many more.

If we categorise and graph these unsuccessful journeys, we will often find a graph like the one below. This example shows that a small number of journey categories account for a large proportion of the problematic journeys. However, it also shows a long tail of other categories that are individually insignificant, but collectively significant.

Long tail of OSS demands

The place where everybody starts (including me) is to focus on the big wins on lhe left side of the graph. lt makes sense that this is where your efforts will have the biggest impacts. Unfortunately, since that’s where everybody focuses effort, chances are the significant gains have already been achieved and optimisation is already quite high (but numbers are still high due constraints within the existing solution stack).

The long tail intrigues me. It’s harder to build a business case for because there are many causes and little return for solving them. Alternatively, if we can slowly and systematically remove many of these rarer events, we’re removing the noise and narrowing our focus on the signal, thus simplifying our overall solution.

Since they’re statistically rarer, we can often afford to be more ruthless in the ways that we prevent them from occurring. But how do we identify and prevent them?

Each of the bars on the chart above represent leaves on a decision tree (faulty leaves in this case). If we work our way back up the branches, we can ruthlessly prune. Examples could be:

  • The removal of obscure service options that lead to faults
  • Reduction (or expansion or refinement) of process designs to better cope with boundary cases that generate problems
  • Removal of grandfathered products that have few customers, generate losses and consume disproportionate support effort
  • Removal or refinement of interfaces, especially east-west systems that contribute little to an end-to-end process but lead to many fall-outs
  • Identification of improved exception handling, either within systems or at hand-off points

I’m sure you can think of many more, especially when you start tracing back up decision trees with a pruning saw in hand!!

 

Interaction points with fast/slow processes

Further to yesterday’s post on fast / slow processes and factory platforms, a concept presented by Sylvain Denis of Orange in Melbourne last week, here’s a diagram from Sylvain’s presentation pack :

The yellow blocks represent the fast (automated) processes. The orange blocks represent the slow processes.

The next slide showed the human interaction points (blue boxes) into this API / factory stack.