You’ve no doubt heard about NOC (Network Operations Centres) and SOC (Security Operations Centres) [or perhaps alternatively, Service Operations Centres], which are the people / processes / infrastructure / tools that allow a network operator to manage the network health and security posture of their networks. The NOC and SOC are vitally important to keeping a modern network running. Unfortunately though, we’re missing a DOC, and I’m not talking about a word processing file here.
So what exactly is a DOC? Well, we’ll get to that shortly.
But first, let’s consider how OSS/BSS and their data contribute to NOC and SOC. Our tools collect all the events and telemetry data, then aggregate it for use within the NOC and SOC tools. Our tools help NOC and SOC teams then process the data, triage the problems, coordinate and manage the remediation efforts.
Speaking of data processing, Data Integrity / Data Quality (DI / DQ) is a significant challenge, and cost, for network operators globally. Most have to invest in systemic programs of work to maintain or improve data quality and ensure the data can be relied upon by the many people who interact with it. Operators know that if their OSS/BSS data goes into a death spiral, then the OSS/BSS tools become useless, no matter how good they are.
The problem with the data-fix programs is that operators tend to prefer algorithmic fixes (the little loop, rather than big loop) to maintain data quality. Algorithmic fixes, designed and implemented by data scientists, tend to be cheaper and easier. However, this has two ramifications. Firstly, little loop fixes tend to reach an asymptote (diminishing rate of return) long before reaching 100% data accuracy. Secondly, algorithmic fixes can only be cost-justified if they’re repairing batches of data.
The reality is that some data, particularly data that can’t be reconciled via an API request, can only be fixed via manual intervention. For example, passive infrastructure like conduits, can’t provide status or configuration updates. Similarly, some data faults are single-instance only and need to be fixed on a data-point by data-point basis. Unfortunately, most carrier processes don’t have the mechanism for immediate data fix – such as when a field-tech is still on site and is in the position to trace out the real situation on site. That’s where the DOC comes in. As you’ve probably worked out, the DOC I’m proposing is a Data Operations Centre.
We don’t have any pre-built data-fix tools like we do for network-fix or security health management today (only the analytics tools that are built for the ad-hoc needs of each customer). Unlike network or security faults, individual users or field workers (or perhaps even customers) can’t log a data fault, or be notified when it’s repaired.
The proposed DOC would be fitted out with the tools and processes required to log a data fault, apply triage to identify the problem / priority, determine a set of remedial activities and then ensure every prioritised fault is repaired. Our OSS/BSS tools have a big part to play (potentially) in supporting a DOC, but we’ll get to that next. First we’ll describe how OSS/BSS data can be better utilised.
The connected nature of our networks mean that faults in the network often ripple out to other parts of the network. It’s the proximity effects – by time (log files), by geo-position (location), by topology (connections to other devices) and hierarchy (relationships such as a card belonging to a device) – that our OSS/BSS store, thus allowing for cascading faults to be identified back to a root-cause.
Some network health issues can be immediate (eg a card failure), whilst some can be more pernicious (eg the slow deterioration of a signal as a connector corrodes over time). Just as a network fault can propagate, so too can a data fault. Data faults tend to be pernicious though and cascading data faults can be harder to pinpoint. Therefore they need to be fixed before they cause ripple-out impacts.
Just like with network adjacencies, data proximity factors are a fundamental element needed to generate a more repeatable approach for data fault management.
The data proximity factors are shown in the diagram above:
- Nodal / Hierarchical Proximity (list #1 above), which shows how data points can have parent-child relationships (eg a port is a child of a card, which is a child of a device, which is a child of a rack, and so on)
- Connected Proximity (list #2), where data points can be cross-linked with each other (eg a port on an antenna is connected to a port on a radio unit)
- Associated Proximity (not shown on diagram), where different data points can be associated with each other (eg a customer service can relate to a circuit, an IP address can relate to a port and/or subnet, a device can relate to a project, and many more)
These proximity factors can be leveraged in the following ways to support a DOC to log, categorise, visualise, then repair data faults:
- Assign Confidence Levels* to each data point, which can be created:
- Manually – where OSS/BSS users, field workers, customers, etc can provide a confidence rating against any given data point, particularly when they experience problematic data
- Algorithmically – where algorithms can analyse data and identify faults (eg perform a trace and identify that only the A-end of a circuit exists, but not Z-end)
- By Lineage – where certain data sources are deemed less reliable than others
- By type / category / class – where data, possibly gathered from an external source, has some data classes that are given different confidence levels (eg circuit names exist, but there’s no port-level connectivity recorded for each circuit)
- Having systematic confidence level rankings allows the following to be created:
- Heat-maps, which show clusters of proximal data faults to be identified for repair
- Fitness Functions or Scorecards, which quickly identify the level of data integrity and whether it is improving / deteriorating
- Data Fault creation rules, which allow a data fault to be logged for repair if certain conditions are met (eg confidence is zero, implying a fault that needs remediation)
- Faults can then be raised, either against individual data points, or jointly for systematic management through to repair / closure
* Note: I’ve only seen one OSS/BSS tool where data confidence functionality was built in. You can read more about it in the link.
Interestingly, the success of the NOC and SOC is dependent upon the quality of the data, so you could argue that a DOC should actually take precedence.
The key call-out in this article comes from drawing a distinction between a DOC and the way data is managed in most organisations, as follows:
- Data quality issues should be treated as data faults
- They need to be treated individually, as each unique data point, not just as a collective to apply an algorithm to (although like network faults, we may choose to aggregate unique data faults and treat them as a collective)
- Each data fault needs to be managed systematically (eg itemised, acknowledged, actioned, possibly assigned remediation workflows, repaired and closed)
- There is an urgency around the fix of each data fault, just like network faults. People who experience the data fault may expect for time-based data-fix SLAs to apply. Firstly so they can perform their actions with greater confidence / reliability. Secondly so the data faults don’t ripple out and cause additional problems
- There is a known contact point (eg phone number, drop-box, etc) for the DOC, so anyone who experiences a data issue knows how to log a fault. By comparison, in many organisations, if a field worker notes a discrepancy between their design pack and the real situation in the field, they just work around the problem and leave without fixing the data fault/s. They invariably have no mechanism for providing feedback. The data problem continues to exist and will cause problems for the next field tech who comes to the same site. Note that there may also be algorithms / rules generating faults, not just humans
- There are notifications upon closure and/or fix of a data fault (if needed)
- We provide the DOC with fault management tools, like the ITSM tools we use to monitor and manage IT or network faults, but for managing data faults. It’s possible that we could even use our standard fault management tools, but with customisation to handle data type faults