OSS holds the key to network slicing

Network slicing opens new business opportunities for operators by enabling them to provide specialized services that deliver specific performance parameters. Guaranteeing stringent KPIs enables operators to charge premium rates to customers that value such performance. The flip side is that such agreements will inevitably come with tough contractual obligations and penalties when the agreed KPIs are not met…even high numbers of slices could be managed without needing to increase the number of operational staff. The more automation applied, the lower the operating costs. At 100 percent automation, there is virtually no cost increase with the number of slices. Granted this is a long-term goal and impractical in the short to medium term, yet even 50 percent automation will bring very significant benefits.”
From a paper by Nokia – “Unleashing the economic potential of network slicing.”

With typical communications services tending towards commoditisation, operators will naturally seek out premium customers. Customers with premium requirements such as latency, throughput, reliability, mobility, geography, security, analytics, etc.

These custom requirements often come with unique network configuration requirements. This is why network slicing has become an attractive proposition. The white paper quoted above makes an attempt at estimating profitability of network slicing including some sensitivity analyses. It makes for an interesting read.

The diagram below is one of many contained in the White Paper:
Nokia Network Slicing

It indicates that a significant level of automation is going to be required to achieve an equivalent level of operational cost to a single network. To re-state the quote, “The more automation applied, the lower the operating costs. At 100 percent automation, there is virtually no cost increase with the number of slices. Granted this is a long-term goal and impractical in the short to medium term, yet even 50 percent automation will bring very significant benefits.”

Even 50% operational automation is a significant ambition. OSS hold the key to delivering on this ambition. Such ambitious automation goals means we have to look at massive simplification of operational variant trees. Simplifications that include, but go far beyond OSS, BSS and networks. This implies whole-stack simplification.

OSS designed as a bundle, or bundled after?

Over the years I’m sure you’ve seen many different OSS demonstrations. You’ve probably also seen presentations by vendors / integrators that have shown multiple different products from their suite.

How integrated have they appeared to you?

  1. Have they seemed tightly integrated, as if carved from a single piece of stone?
  2. Or have they seemed loosely integrated, a series of obviously different stones joined together with some mortar?
  3. Or perhaps even barely associated, a series of completely different objects (possibly through product acquisition) branded under a common marketing name?

There are different pros and cons with each approach. Tight integration possibly suits a greenfields OSS. Looser integration perhaps better suits carve-off for best-of-breed customer architecture models.

I don’t know about you, but I always prefer to be given the impression that an attempt has been made to ensure consistency in the bundling. Consistency of user-interface, workflow, data modelling/presentation, reports, etc. With modern presentation layers, database technologies and the availability of UX / CX expertise, this should be less of a hurdle than it has been in the past.

Very little OSS data is ever actually used

We keep shiploads of data in our OSS don’t we? Just think about how much storage your OSS estate consumes.

Technically, it doesn’t cost much (relatively) to retain all that potential for insight generation with the cost of storage diminishing. The real cost of storing the data goes a little deeper than the $/Mb though. Other cost factors include data curation, cleansing, database search performance, etc.

There’s a whole field of study relating to this, named Information Lifecycle Management (ILM), but let’s look at it in terms of relevance to OSS.

We collect information across different timescales including real-time processing, short-term correlations, longer-term trending and long-term statutory / regulatory.

Information Lifecycle Management
Note: I suspect the “Less Archive” box actually should say “Less Active”.
Diagram above sourced from here.

But rather than blindly just storing everything, we could ask ourselves at what stage does each data sub-set lose relevance. As our OSS data ages, it can tend to deteriorate because the models it uses also deteriorate. Model deterioration factors, such as those described in this recent post about a machine-learning PoC and the following, are numerous:

  • Network devices change (including cards, naming conventions used, life-cycle upgrades, capacity, new alarm types, etc)
  • Network topologies change
  • Business processes change
  • Customer behaviours change
  • Product / Service offerings change
  • Regulations change
  • New datasets become available
  • Data model factors change to cope with gaps in original models

Each of these factors (and more) lead to deterioration in the usefulness of baseline data. This means the insight signals in the data becomes less clear, or at worst the baseline needs to be re-established, making old data invalid. If it’s invalid, then retention would appear to be pointless. Shifting it to the right through the storage types shown in the diagram above could also be pointless.

Very little of the OSS data you store is ever actually used, decreasingly so as it ages. Do you have a heatmap of what data you use in your OSS?

Where are the reliability hotspots in your OSS?

As you already know, there are two categories of downtime – unplanned (eg failures) and planned (eg upgrades / maintenance).

Planned downtime sounds a lot nicer (for operators) but the reality is that you could call both types “incidents” – they both impact (or potentially impact) the customer. We sometimes underestimate that fact.

Today’s question is whether you’re able to identify where the hotspots are in your OSS suite when you combine both types of downtime. Can you tell which outages are service-impacting?

In a round-about way, I’m asking whether you already have a dashboard that monitors uptime of all the components (eg applications, probes, middleware, infra, etc) that make up your complete OSS / BSS estate? If you do, does it tell you what you anecdotally know already, or are there sometimes surprises?

Does the data give you the evidence you need to negotiate with the implementers of problematic components (eg patch cadence, the need for reliability fixes, streamlining the patch process, reduction in customisations, etc)? Does it give you reason to make architectural changes (eg webscaling)?

OSS operationalisation at scale

We had a highly flexible network design team at a previous company. Not because we wanted to necessarily, but because we were forced to by the client’s allocation of work.

Our team was largely based on casual workers because there was little to predict whether we needed 2 designers or 50 in any given week. The workload being assigned by the client was incredibly lumpy.

But we were lucky. We only had design work. The lumpiness in design effort flowed down through the work stack into construction, test and deployment teams. The constructors had millions of dollars of equipment that they needed to mobilise and demobilise as the work ebbed and flowed. Unfortunately for the constructors, they’d prepared their rate cards on the assumption of a fairly consistent level of work coming through (it was a very big project).

This lumpiness didn’t work out for anyone in the delivery pipeline, the client included. It was actually quite instrumental in a few of the constructors going into liquidation. The client struggled to meet roll-out targets.

The allocation of work was being made via the client’s B/OSS stack. The B/OSS teams were blissfully unaware of the downstream impact of their sporadic allocation of designs. Towards the end of the project, they were starting to get more consistent and delivery teams started to get into more of a rhythm… just as the network was coming to the end of build.

As OSS builders, we sometimes get so wrapped up in delivering functionality that we can forget that one of the key requirements of an OSS is to operationalise at scale. In addition to UI / CX design, this might be something as simple as smoothing the effort allocation for work under our OSS‘s management.

OSS data Ponzi scheme

The more data you have, the more data you need to understand the data you have. You are engaged in a data ponzi scheme…Could it be in service assurance and IT ops that more data equals less understanding?
Phil Tee
in the opening address at the AIOps Symposium.

Interesting viewpoint right?

Given that our OSS hold shed-loads of data, Phil is saying we need lots of data to understand that data. Well, yes… and possibly no.

I have a theory that data alone doesn’t talk, but it’s great at answering questions. You could say that you need lots of data, although I’d argue in semantics that you actually need lots of knowledge / awareness to ask great questions. Perhaps that knowledge / awareness comes from seeding machine-led analysis tools (or our data scientists’s brains) with lots of data.

The more data you have, the more noise that you need to find signal in amongst. That means you have to ask more questions of your data if you want to drive a return that justifies the cost of collecting and curating it all. Machine-led analytics certainly assist us in handling the volume and velocity of data our OSS create / collect. That’s just asking the same question/s over and over. There’s almost no end to the questions that can be asked of our data, just a limit on the time in which we can ask it.

Does that make data a Ponzi scheme? A Ponzi scheme pays profits to earlier investors using funds obtained from newer investors. Eventually it must collapse the scheme eventually runs out of new investors to fund profits. In a data Ponzi scheme, it pays in insights from earlier (seed) data by obtaining new (streaming) data. The stream of data reaching an OSS never runs out. If we need to invest heavily in data (eg AI / ML, etc), at what point in the investment lifecycle will we stop creating new insights?

Help needed: IoT / OSS cross-over use cases

Hi PAOSS community.

I’d like to call in a favour today if I may. I’m on the hunt for any existing use-cases and / or project sites that have integrated a significant sensor network into their OSS and existing operational processes.

That includes a strategy for handling IoT-scale integration of data collection, event / alarm processing, device management, data contextualization, data analytics, end-to-end security and applications management / enablement within existing OSS tools.

I’m looking for examples where an OSS had previously managed thousands of (network) devices and is now managing hundreds of thousands of (IoT) devices. Not necessarily IoT devices of customers as services but within an operator’s own network.

Obviously that’s an unprecedented change in scale in traditional OSS terms, but will be commonplace if our OSS are to play a part in the management of large sensor networks in the future.

There’s an element of mutual exclusivity between what an IoT management platform and OSS needs to do, but there are also some similarities. I’d love to speak with anyone who has actually bridged the gap.

If your partners don’t have to talk to you then you win

If your partners don’t have to talk to you then you win.”
Guy Lupo
.

Put another way, the best form of customer service is no customer service (ie your customers and/or partners are so delighted with your automated offerings that they have no reason to contact you). They don’t want to contact you anyway (generally speaking). They just want to consume a perfectly functional and reliable solution.

In the deep, distant past, our comms networks required operators. But then we developed automated dialling / switching. In theory, the network looked after itself and people made billions of calls per year unassisted.

Something happened in the meantime though. Telco operators the world over started receiving lots of calls about their platform and products. You could say that they’re unwanted calls. The telcos even have an acronym called CVR – Call Volume Reduction – that describes their ambitions to reduce the number of customer calls that reach contact centre agents. Tools such as chatbots and IVR have sprung up to reduce the number of calls that an operator fields.

Network as a Service (NaaS), the context within Guy’s comment above, represents the next new tool that will aim to drive CVR (amongst a raft of other benefits). NaaS theoretically allows customers to interact with network operators via impersonal contracts (in the form of APIs). The challenge will be in the reliability – ensuring that nothing falls between the cracks in any of the layers / platforms that combine to form the NaaS.

In the world of NaaS creation, Guy is exactly right – “If your partners [and customers] don’t have to talk to you then you win.” As always, it’s complexity that leads to gaps. The more complex the NaaS stack, the less likely you are to achieve CVR.

What OSS environments do you need?

When we’re planning a new OSS, we tend to be focused on the production (PROD) environment. After all, that’s where it’s primary purpose is served, to operationalise a network asset. That is where the majority of an OSS‘s value gets created.

But we also need some (roughly) equivalent environments for separate purposes. We’ll describe some of those environments below.

By default, vendors will tend to only offer licensing for a small number of database instances – usually just PROD and a development / test environment (DEV/TEST). You may not envisage that you will need more than this, but you might want to negotiate multiple / unlimited instances just in case. If nothing else, it’s worth bringing to the negotiation table even if it gets shot down because budgets are tight and / or vendor pricing is inflexible relating to extra environments.

Examples where multiple instances may be required include:

  1. Production (PROD) – as indicated above, that’s where the live network gets managed. User access and controls need to be tight here to prevent catastrophic events from happening to the OSS and/or network
  2. Disaster Recovery (DR) – depending on your high-availability (HA) model (eg cold standby, primary / redundant, active / active), you may require a DR or backup environment
  3. Sandpit (DEV / TEST) – these environments are essential for OSS operators to be able to prototype and learn freely without the risk of causing damage to production environments. There may need to be multiple versions of this environment depending on how reflective of PROD they need to be and how viable it is to take refresh / updates from PROD (aka PROD cuts). Sometimes also known as non-PROD (NP)
  4. Regression testing (REG TEST) – regression testing requires a baseline data set to continually test and compare against, flagging any variations / problems that have arisen from any change within the OSS or networks (eg new releases). This implies a need for data and applications to be shielded from the constant change occurring on other types of environments (eg DEV / TEST). In situations where testing transforms data (eg activation processes), REG TEST needs to have the ability to roll-back to the previous baseline state
  5. Training (TRAIN) – your training environments may need to be established with a repeatable set of training scenarios that also need to be re-set after each training session. This should also be separated from the constant change occurring on dev/test environments. However, due to a shortage of environments, and the relative rarity of training needed at some customers, TRAIN often ends up as another DEV or TEST environment
  6. Production Support (PROD-SUP) – this type of environment is used to prototype patches, releases or defect fixes (for defects on the PROD environment) prior to release into PROD. PROD-SUP might also be used for stress and volume testing, or SVT may require its own environment
  7. Data Migration (DATA MIG) – At times, data creation and loading needs to be prototype in a non-PROD environment. Sometimes this can be done in PROD-SUP or even a DEV / TEST environment. On other occasions it needs its own dedicated environment so as to not interrupt BAU (business as usual) activities on those other environments
  8. System Integration Testing (SIT)OSS integrate with many other systems and often require dedicated integration testing environments

Am I forgetting any? What other environments do you find to be essential on your OSS?

Designing an Operational Domain Manager (ODM)

A couple of weeks ago, Telstra and the TM Forum held an event in Melbourne on OSS for next gen architectures.

The diagram below comes from a presentation by Corey Clinger. It describes Telstra’s Operational Domain Manager (ODM) model that is a key component of their Network as a Service (NaaS) framework. Notice the API stubs across the top of the ODM? Corey went on to describe the TM Forum Open API model that Telstra is building upon.
Operational Domain Manager (ODM)

In a following session, Raman Balla indicated an perspective that differs from many existing OSS. The service owner (and service consumer) must know all aspects of a given service (including all dimensions, lifecycle, etc) in a common repository / catalog and it needs to be attribute-based. Raman also indicated that the aim he has for architecting NaaS is to not only standardise the service, but the entire experience around the service.

In the world of NaaS, operators can no longer just focus separately on assurance or fulfillment or inventory / capacity, etc. As per DevOps, operators are accountable for everything.

Security and privacy as an OSS afterthought?

I often talk about OSS being an afterthought for network teams. I find that they’ll often design the network before thinking about how they’ll operationalise it with an OSS solution. That’s both in terms of network products (eg developing a new device and only thinking about building the EMS later), or building networks themselves.

It can be a bit frustrating because we feel we can give better solutions if we’re in the discussion from the outset. As OSS people, I’m sure you’ll back me up on this one. But we can’t go getting all high and mighty just yet. We might just be doing the same thing… but to security, privacy and analytics teams.

In terms of security, we’ll always consider security-based requirements (usually around application security, access management, etc) in our vendor / product selections. We’ll also include Data Control Network (DCN) designs and security appliance (eg firewalls, IPS, etc) effort in our implementation plans. Maybe we’ll even prescribe security zone plans for our OSS. But security is more than that (check out this post for example). We often overlook the end-to-end aspects such central authentication, API hardening, server / device patching, data sovereignty, etc and it then gets picked up by the relevant experts well into the project implementation.

Another one is privacy. Regulations like GDPR and the Facebook trials show us the growing importance of data privacy. I have to admit that historically, I’ve been guilty on this one, figuring that the more data sets I could stitch together, the greater the potential for unlocking amazing insights. Just one problem with that model – the more data sets that are stitched together, the more likely that privacy issues arise.

We increasingly have to figure out ways to weave security, privacy and analytics into our OSS planning up-front and not just think of them as overlays that can be developed after all of our key decisions have been made.

Zero touch network & Service Management (ZSM)

Zero touch network & Service Management (ZSM) is a next-gen network management approach using closed-loop principles hosted by ETSI. An ETSI blog has just demonstrated the first ZSM Proof of Concept (PoC). The slide deck describing the PoC, supplied by EnterpriseWeb, can be found here.

The diagram below shows a conceptual closed-loop assurance architecture used within the PoC
ETSI ZSM PoC.

It contains some similar concepts to a closed-loop traffic engineering project designed by PAOSS back in 2007, but with one big difference. That 2007 project was based on a single-vendor solution, as opposed to the open, multi-vendor PoC demonstrated here. Both were based on the principle of using assurance monitors to trigger fulfillment responses. For example, ours used SLA threshold breaches on voice switches to trigger automated remedial response through the OSS‘s provisioning engine.

For this newer example, ETSI’s blog details, “The PoC story relates to a congestion event caused by a DDoS (Denial of Service) attack that results in a decrease in the voice quality of a network service. The fault is detected by service monitoring within one or more domains and is shared with the end-to-end service orchestrator which correlates the alarms to interpret the events, based on metadata and metrics, and classifies the SLA violations. The end-to-end service orchestrator makes policy-based decisions which trigger commands back to the domain(s) for remediation.”

You’ll notice one of the key call-outs in the diagram above is real-time inventory. That was much harder for us to achieve back in 2007 than it is now with virtualised network and compute layers providing real-time telemetry. We used inventory that was only auto-discovered once daily and had to build in error handling, whilst relying on over-provisioned physical infrastructure.

It’s exciting to see these types of projects being taken forward by ETSI, EnterpriseWeb, et al.

An OSS data creation brain-fade

Many years ago, I made a data migration blunder that slowed a production OSS down to a crawl. Actually, less than a crawl. It almost became unusable.

I was tasked with creating a production database of a carrier’s entire network inventory, including data migration for a bunch of Nortel Passport ATM switches (yes, it was that long ago).

  • There were around 70 of these devices in the network
  • 14 usable slots in each device (ie slots not reserved for processing, resilience, etc)
  • Depending on the card type there were different port densities, but let’s say there were 4 physical ports per slot
  • Up to 2,000 VPIs per port
  • Up to 65,000 VCIs per VPI
  • The customer was running SPVC

To make it easier for the operator to create a new customer service, I thought I should script-create every VPI/VCI on every port on every devices. That would allow the operator to just select any available VPI/VCI from within the OSS when provisioning (or later, auto-provisioning) a service.

There was just one problem with this brainwave. For this particular OSS, each VPI/VCI represented a logical port that became an entry alongside physical ports in the OSS‘s ports table… You can see what’s about to happen can’t you? If only I could’ve….

My script auto-created nearly 510 billion VCI logical ports; over 525 billion records in the ports table if you also include VPIs and physical ports…. in a production database. And that was just the ATM switches!

So instead of making life easier for the operators, it actually brought the OSS‘s database to a near stand-still. Brilliant!!

Luckily for me, it was a greenfields OSS build and the production database was still being built up in readiness for operational users to take the reins. I was able to strip all the ports out and try again with a less idiotic data creation plan.

The reality was that there’s no way the customer could’ve ever used 2,000 x 65,000 VPI/VCI groupings I’d created on every single physical port. Put it this way, there were far less than 130 million services across all service types across all carriers across that whole country!

Instead, we just changed the service activation process to manually add new VPI/VCIs into the database on demand as one of the pre-cursor activities when creating each new customer service.

From that experience, I have reverted back to the Minimum Viable Data (MVD) mantra ever since.

Network slicing, another OSS activity

One business customer, for example, may require ultra-reliable services, whereas other business customers may need ultra-high-bandwidth communication or extremely low latency. The 5G network needs to be designed to be able to offer a different mix of capabilities to meet all these diverse requirements at the same time.
From a functional point of view, the most logical approach is to build a set of dedicated networks each adapted to serve one type of business customer. These dedicated networks would permit the implementation of tailor-made functionality and network operation specific to the needs of each business customer, rather than a one-size-fits-all approach as witnessed in the current and previous mobile generations which would not be economically viable.
A much more efficient approach is to operate multiple dedicated networks on a common platform: this is effectively what “network slicing” allows. Network slicing is the embodiment of the concept of running multiple logical networks as virtually independent business operations on a common physical infrastructure in an efficient and economical way.
.”
GSMA’s Introduction to Network Slicing.

Engineering a network is one of compromises. There are many different optimisation levers to pull to engineer a set of network characteristics. In the traditional network, it was a case of pulling all the levers to find a middle-ground set of characteristics that supported all their service offerings.

QoS striping of traffic allowed for a level of differentiation of traffic handling, but the underlying network was still a balancing act of settings. Network virtualisation offers new opportunities. It allows unique segmentation via virtual networks, where each can be optimised for the specific use-cases of that network slice.

For years, I’ve been posing the concept of telco offerings being like electricity networks – that we don’t need so many service variants. I should note that this analogy is not quite right. We do have a few different types of “electricity” such as highly available (health monitoring), high-bandwidth (content streaming), extremely low latency (rapid reaction scenarios such as real-time sensor networks), etc.

Now what do we need to implement and manage all these network slices?? Oh that’s right, OSS! It’s our OSS that will help to efficiently coordinate all the slicing and dicing that’s coming our way… to optimise all the levers across all the different network slices!

A defacto spatial manager

Many years ago, I was lucky enough to lead a team responsible for designing a complex inside and outside plant network in a massive oil and gas precinct. It had over 120 buildings and more than 30 networked systems.

We were tasked with using CAD (Computer Aided Design) and Office tools to design the comms and security solution for the precinct. And when I say security, not just network security, but building access control, number plate recognition, coast guard and even advanced RADAR amongst other things.

One of the cool aspects of the project was that it was more three-dimensional than a typical telco design. A telco cable network is usually planned on x and y coordinates because the y coordinate is usually on one or two planes (eg all ducts are at say 0.6m below ground level or all catenary wires between poles are at say 5m above ground). However, on this site, cable trays ran at all sorts of levels to run around critical gas processing infrastructure.

We actually proposed to implement a light-weight OSS for management of the network, including outside plant assets, due to the easy maintainability compared with CAD files. The customer’s existing CAD files may have been perfect when initially built / handed-over, but were nearly useless to us because of all the undocumented that had happened in the ensuing period. However, the customer was used to CAD files and wanted to stay with CAD files.

This led to another cool aspect of the project – we had to build out defacto OSS data models to capture and maintain the designs.

We modelled:

  • The support plane (trayway, ducts, sub-ducts, trenches, lead-ins, etc)
  • The physical connectivity plane (cables, splices, patch-panels, network termination points, physical ports, devices, etc)
  • The logical connectivity plane (circuits, system connectivity, asset utilisation, available capacity, etc)
  • Interconnection between these planes
  • Life-cycle change management

This definitely gave a better appreciation for the type of rules, variants and required data sets that reside under the hood of a typical OSS.

Have you ever had a non-OSS project that gave you a better appreciation / understanding of OSS?

I’m also curious. Have any of you used designed your physical network plane in three dimensions? With a custom or out-of-the-box tool?

Unexpected OSS indicators

Yesterday’s post talked about using customer contacts as a real-time proxy metric for friction in the business, which could also be a directional indicator for customer experience.

That got me wondering what other proxy metrics might be used by to provide predictive indicators of what’s happening in your network, OSS and/or BSS. Apparently, “Colt aims to enhance its service assurance capabilities by taking non-traditional data (signal strength, power, temperature, etc.) from network elements (cards, links, etc.) to predict potential faults,” according to James Crawshaw here on LightReading.

What about environmental metrics like humidity, temperature, movement, power stability/disturbance?

I’d love to hear about what proxies you use or what unexpected metrics you’ve found to have shone the spotlight on friction in your organisation.

Shooting the OSS messenger

NPS, or Net Promoter Score, has become commonly used in the telecoms industry in recent years. In effect, it is a metric that measures friction in the business. If NPS is high, the business runs more smoothly. Customers are happy with the service and want to buy more of it. They’re happy with the service so they don’t need to contact the business. If NPS is low, it’s harder to make sales and there’s the additional cost of time dealing with customer complaints, etc (until the customer goes away of course).

NPS can be easy to measure via survey, but a little more challenging as a near-real-time metric. What if we used customer contacts (via all channels such as phone, IVR, email, website, live-chat, etc) as a measure of friction? But more importantly, how does any of this relate to OSS / BSS? We’ll get to that shortly (I hope).

BSS (billing, customer relationship management, etc) and OSS (service health, network performance, etc) tend to be the final touchpoints of a workflow before reaching a customer. When the millions of workflows through a carrier are completing without customer contact, then friction is low. When there are problems, calls go up and friction / inefficiency is also going up. When there are problems, the people (or systems) dealing with the calls (eg contact centre operators) tend to start with OSS / BSS tools and then work their way back up the funnel to identify the cause of friction and attempt to resolve it.

The problem is that the OSS / BSS tools are often seen as the culprit because that’s where the issue first becomes apparent. It’s easier to log an issue against the OSS than to keep tracking back to the real source of the problem. Many times, it’s a case of shooting the messenger. Not only that, but if we’re not actually identifying the source of the problem then it becomes systemic (ie the poor customer experience perpetuates).

Maybe there’s a case for us to get better at tracking the friction caused further upstream of our OSS / BSS and to give more granular investigative tools to the call takers. Even if we do, our OSS / BSS are still the ones delivering the message.

The OSS Matrix – the blue or the red pill?

OSS Matrix
OSS tend to be very good at presenting a current moment in time – the current configuration of the network, the health of the network, the activities underway.

Some (but not all) tend to struggle to cope with other moments in time – past and future.

Most have tools that project into the future for the purpose of capacity planning, such as link saturation estimation (based on projecting forward from historical trend-lines). Predictive analytics is a current buzz-word as research attempts to predict future events and mitigate for them now.

Most also have the ability to look into the past – to look at historical logs to give an indication of what happened previously. However, historical logs can be painful and tend towards forensic analysis. We can generally see who (or what) performed an action at a precise timestamp, but it’s not so easy to correlate the surrounding context in which that action occurred. They rarely present a fully-stitched view in the OSS GUI that shows the state of everything else around it at that snapshot in time past. At least, not to the same extent that the OSS GUI can stitch and present current state together.

But the scenario that I find most interesting is for the purpose of network build / maintenance planning. Sometimes these changes occur as isolated events, but are more commonly run as projects, often with phases or milestone states. For network designers, it’s important to differentiate between assets (eg cables, trenches, joints, equipment, ports, etc) that are already in production versus assets that are proposed for installation in the future.

And naturally those states cross over at cut-in points. The proposed new branch of the network needs to connect to the existing network at some time in the future. Designers need to see available capacity now (eg spare ports), but be able to predict with confidence that capacity will still be available for them in the future. That’s where the “reserved” status comes into play, which tends to work for physical assets (eg physical ports) but can be more challenging for logical concepts like link utilisation.

In large organisations, it can be even more challenging because there’s not just one augmentation project underway, but many. In some cases, there can be dependencies where one project relies on capacity that is being stood up by other future projects.

Not all of these projects / plans will make it into production (eg funding is cut or a more optimal design option is chosen), so there is also the challenge of deprecating planned projects. Capability is required to find whether any other future projects are dependent on this deprecated future project.

It can get incredibly challenging to develop this time/space matrix in OSS. If you’re a developer of OSS, the question becomes whether you want to take the blue or red pill.

Front-loading with OSS auto-discovery

Yesterday’s post discussed the merits of front-loading effort on knowledge transfer of new starters and automated testing, whilst acknowledging the challenges that often prevent that from happening.

Today we look at the front-loading benefits of building OSS / network auto-discovery tools.

We all know that OSS are only as good as the data we seed them with. As the old saying goes, garbage in, garbage out.

Assurance / network-health data is generally collected directly from the network in near real time, typically using SNMP traps, syslog events and similar. The network is generally the data master for this assurance-style data, so it makes sense to pull data from the network wherever possible (ie bottom-up data flows).

Fulfilment data, in the form of customer orders, network designs, etc are often captured in external systems first (ie as master) and pushed into the network as configurations (ie top-down data flows).
Bottom-up and top-down OSS data flows

These two flows meet in the middle, as they both tend to rely on network inventory and/or resources. Bottom-up – Network faults to be tied to inventory / resource / topology information (eg fibre cuts, port failure, device failure, etc) which are important for fault identification (Root Cause Analysis [RCA]). Similarly for top-down – customer services / circuits / contracts tend to consume inventory, capacity and/or other resources.

Looking end-to-end and correlating network health (assurance) to customer service health (fulfilment) (eg as Service Level Agreement [SLA] analysis, Service Impact Analysis [SIA]) tends to only be possible due to reconciliation via inventory / resource data sets as linking keys.

Seeding an OSS with inventory / resource data can be done via three methods:

  1. Data migration (eg script-loading from external sources such as spreadsheet or CSV files)
  2. Manual data creation
  3. Auto-discovery (ie collection of data directly from the network)
  4. (or a combination of the above)

Options 1 and 2 are probably the more traditional method of initial seeding of OSS databases, mainly because they tend to be faster to demonstrate progress.

Option 3 is the front-loading option that can be challenging in the short-to-medium term but will hopefully prove beneficial in the longer term (just like knowledge transfer and automated testing).

It might seem easy to just suck data directly out of the network, but the devil is definitely in the detail, details such as:

  • Choosing optimal timings to poll the network without saturating it (if notification aren’t being pushed by the network), not to mention session handling of these long-running data transfers
  • Building the mediation layer to perform protocol conversion and data mappings
  • Field translation / mapping to common naming standards. This can be much more difficult than it sounds and is key to allow the assurance and fulfilment meet-in-the-middle correlations to occur (as described above)
  • Reconciliation between the data being presented by the network and what’s already in the OSS database. In theory, the data presented by the network should always be right, but there are scenarios where it’s not (eg flapping ports giving the appearance of assets being present / absent from network audit data, assets in test / maintenance mode that aren’t intended be accepted into the OSS inventory pool, lifecycle transitions from planned to built, etc)
  • Discrepancy and exception handling rules
  • All of the above make it challenging for “siloed” data sets, but perhaps even more challenging is in the discovery and auto-stitching of cross-domain data sets (eg cross-domain circuits / services / resource chains)

The vexing question arises – do you front-load and seed via auto-discovery or perform a creation / migration that requires more manual intervention? In some cases, it could even be a combination of both as some domains are easier to auto-discover than others.

From PoC to OSS sandpit

You all know I’m a fan of training operators in OSS sandpits (and as apprenticeships during the build phase) rather than a week or two of classroom training at the end of a project.

To reduce the re-work in building a sandpit environment, which will probably be a dev/test environment rather than a production environment, I like to go all the way back to the vendor selection process.
From PoC to OSS sandpit

Running a Proof of Concept (PoC) is a key element of vendor selection in my opinion. The PoC should only include a small short-list of pre-selected solutions so as to not waste time of operator or vendor / integrator. But once short-listed, the PoC should be a cut-down reflection of the customer’s context. Where feasible, it should connect to some real devices / apps (maybe lab devices / apps, possibly via a common/simple interface like SNMP). This takes some time on both sides to set up, but it shows how easily (or not) the solution can integrate with the customer’s active network, BSS, etc. It should be specifically set up to show the device types, alarm types, naming conventions, workflows, etc that fit into the customer’s specific context. That allows the customer to understand the new OSS in terms they’re familiar with.

And since the effort has been made to set up the PoC, doesn’t it make sense to make further use of it and not just throw it away? If the winning bidder then leaves the PoC environment in the hands of the customer, it becomes the sandpit to play in. The big benefit for the winning bidder is that hopefully the customer will have less “what if?” questions that distract the project team during the implementation phase. Questions can be demonstrated, even if only partially, using the sandpit environment rather than empty words.