Building the Ultimate Network and Service Assurance Framework

Table of Contents

Q. What do you get when you deliver customer services over a network that combines GPON, DSL, LTE and a stack of other network technologies?

A. A service assurance nightmare!!

Every operator claims high availability. Five-nines (99.999% uptime) just never seems to be enough – but in reality, it’s dubious whether it’s even achievable.

This is where your network and service assurance framework can prove it – although it might be aided by the right exclusions, attribution logic and customer-facing transparency approach.

Here’s how the best telcos are tackling (and measuring) the immensely challenging and diverse world of network and service availability.

.

Step 1: Define the Boundaries of Network and Service Assurance

Before designing KPIs (Key Performance Indicators) or service-level metrics, it’s crucial to define what’s actually being assured: the physical network, or the services that run across it. Each scope has its own set of assumptions, stakeholders and impact zones – and clarity here affects everything downstream.

1.1 Network Assurance vs Service Assurance

  • Network assurance (or what I personally refer to as “Nodal” Assurance – as in, each of the nodes/components in the network): focuses on physical infrastructure uptime. This could include device availability, port availability, link-level performance, device health and topology integrity. It answers the question: Is the network technically up?

  • Service assurance (or what I refer to as E2E Assurance, as in, the end-to-end experience of the customer / service): focuses on the customer experience. This includes real or perceived outages, degradation events, combinatory effects, or access failures even when the underlying network appears healthy. It answers the question: Is the customer able to use the service they’re paying for

The two are not interchangeable:

  • A GPON OLT might be “up” (network assurance) while the downstream ONTs from it are powered down (service disruption). [Note: we will discuss scenarios where powered-down ONTs might be discounted from calculations later]

  • A core router could fail over cleanly, showing full redundancy success in network terms – yet some customers might still experience dropped sessions or degraded performance (a service-level impact)

.

1.2 Ensuring Assurance: From Customer Promise to Network & System Architecture

When it comes to assurance and what we need to measure / manage, it all starts with the customer (as almost all OSS/BSS design should). More specifically, it starts with what the network operator is promising to deliver to the customer:

  • Customer Promise – The operator’s public or contractual commitment (e.g. “99.99% uptime for business broadband”)

  • Product Promise – How this commitment is translated into SLAs (Service Level Agreements), QoS (Quality of Service) guarantees and availability metrics

  • Service Design – The architecture and procedures required to fulfil the product promise, including MTBF targets, response times and redundancy levels

Each level must inform the layers beneath it. A vague or undefined customer promise leads to disconnected KPIs, lack of transparency around customer expectations and a challenge for operational teams to know exactly what their obligations are.

.

1.3 Internal vs External Metrics

Operators must differentiate between:

  • Internal-facing metrics used for planning, engineering and NOC operations (e.g. port-level uptime, network MTBF). These are generally the raw, unmanipulated numbers – the brutal reality. These are generally network assurance or “nodal” measures described earlier, because it’s typically much easier to collect metrics from each device. Service assurance or “E2E” measures are still used, but often less commonly because these are often derived values and/or much harder to calculate

  • Customer-facing metrics used for SLAs, transparency and regulatory reporting (e.g. Customer Impact Hours, service availability %, ticket-based resolution times). These often incorporate exclusions (such as force majeure events), which is acceptable, as long as there is transparency for customers. These are typically more E2E or Service Assurance in nature

These exclusions, or fault attribution models (more on these later), are often drawn from regulatory rules or offer documents and may incorporate rules such as:

  • Intentional Outages – Uptime excludes planned outages such as scheduled maintenance

  • Events Outside the Operator’s Control – Excludes uncontrollable events such as force majeure, power outages not caused by the operator (eg customer impacted outages like the customer powering down the ONT in its premises), third-party facility failures, etc

  • Different types of networks (eg Satellite and non-Satellite networks) often have distinct availability calculations

  • Downtime is segmented into intentional vs uncontrollable vs actual impact – giving a granular model for internal ops teams and standardised logic for customer-facing reports​

However, operators still record a total of all downtime without these exclusions as it helps to identify where changes / improvements might be needed.

.

1.4 Simple Availability Equations – Availability and CIH (Customer Impacted Hours)

1.4.1 Availability

AVAILABILITY = (TOTAL_TIME – DOWNTIME) / TOTAL_TIME

for example, in a given year, there are 8,760 hours, so for example, if a device is down for 5 hours in a year then the availability is:

(8760 – 5)/8760 = 99.943% availability.

As the table below suggests, five-nines is only around 5 minutes of downtime in an entire year.

  Days / YearHours / YearMins / Year
Five Nines99.999%0.003650.08765.26
Four Nines99.990%0.03650.87652.56
Three Nines99.900%0.3658.76525.6
Two Nines99.000%3.6587.65,256

.

1.4.2 Customer Impacted Hours (CIH)

But availability isn’t the only metric that matters. It’s not just about “how long the network was down” – it’s about who was impacted and for how long. That’s where Customer Impacted Hours (CIH) provides clarity:

  • CIH = CUSTOMERS × IMPACT_HOURS
    number of customers × duration of impact (note that if there’s more than one outage then this formula relates to the sum of customers × impacts)

  • Average Customer Availability = 1 – (CIH / (CUSTOMERS × 8,760)

  • This creates a bridge between network uptime and customer service reliability

For example:
If 100,000 customers were impacted for a total of 8,760 hours (an average of 5.26 minutes per customer per year), the customer-level availability is 99.999% (five-nines) but the granularity of CIH allows targetted investment in the most frequently impacted service areas.

.

1.5 Defining Service Assurance Domains

To make service assurance measurable, it should be segmented by service type, domain or delivery model:

  • Access technologies: DSL, GPON, HFC, LTE, Fixed Wireless, etc.

  • Service tiers: best effort vs premium SLA-backed vs managed endpoint

  • Customer cohorts: residential vs enterprise vs wholesale

  • Service chains: single vs multi-vendor, single- vs multi-domain

Each domain may require unique KPI targets and fault attribution rules. For example, the NBN separates its GEO satellite-based services from fixed access in its availability calculations due to the environmental and operational complexity​.

.

Step 2: Calculate Availability with MTBFs, Redundancy and Service Chains

Availability isn’t a single KPI – it’s a system-level outcome shaped by architecture, hardware reliability, and the logical design of services. To move beyond vague “nodal” uptime percentages, operators need a structured approach that calculates true end-to-end service availability, grounded in platform MTBFs, failure domain modelling and redundancy configurations.

2.1 Understand the Predicted Availability Equation

The previous definition of availability above was actual recorded availability. However, when designing the network to accommodate the promises made to customers, we need to calculate the predicted availability first and engineer the network’s resilience measures around that.

Predicted Availability is classically defined as:

Availability = MTBF / (MTBF + MTTR)

  • MTBF (Mean Time Between Failures) represents the expected operational lifespan between service-impacting failures

  • MTTR (Mean Time to Repair) is the average time it takes to restore service after a failure is detected

When availability needs to be expressed as a percentage over time, you can use:

Availability (%) = (Total Time – Downtime) / Total Time × 100

These calculations must be repeated across each domain of the service path to calculate the full system’s availability

.

2.2 Serial vs Parallel Availability Logic

In complex networks, availability must be computed across a chain of components

  • Serial logic applies when components have no failover – failure of any one results in service loss

    • Formula: A_total = A1 × A2 × A3 … An

  • Parallel logic applies when redundant systems are deployed

    • Formula: A_total = 1 – [(1 – A1) × (1 – A2)]

This parallel logic reflects high-availability designs like:

  • Dual BNGs (Dual-homed or redundant BNG [Broadband Network Gateways, a critical component in FTTx networks] designs allow failover and parallel path availability)

  • Fibre rings (A key architectural pattern in telco transport and access networks, designed to provide path-level redundancy)

  • Redundant path routing

Mixing serial and parallel models allows operators to mirror actual service design, especially across heterogeneous networks

2.3 Topology-Informed Service Paths

Availability must be calculated not just per platform, but per end-to-end service chain

For example:

  • A broadband service might traverse: ONT – OLT – Aggregation Router – BNG – Core Router – Internet Exchange

  • Each platform’s MTBF and MTTR is calculated

  • Redundancy (if present) is factored into any parallel sections

To do this effectively:

  • Use topology-aware inventory systems (e.g. CMDBs) or network design patterns to map service paths

  • Associate MTBF values to each logical and physical node

  • Adjust for failover logic where it exists

This transforms assurance from a set of abstract KPIs into a model that mirrors real-world behaviour

.

2.4 Account for Technology Diversity

Different access technologies have different baseline availability characteristics due to their physical mediums, power dependencies and resilience features

  • FTTP/GPON – typically high availability, especially with OLT redundancy and fibre loops

  • DSL/FTTN – lower MTBF due to copper degradation and environmental factors

  • HFC – vulnerable to localised noise and shared media congestion

  • Fixed Wireless – variable based on RF conditions, tower availability and power supply

  • Satellite – more affected by uncontrollable events like weather and gateway dependency

Model availability per technology domain and avoid averaging across fundamentally different platforms.

.

2.5 Tie Availability to Fault Data

Availability estimates are more credible when validated by real-world fault records, such as:

  • Number of SEV1 and SEV2 tickets per platform

  • Duration of faults from open to close

  • Customer-impact footprint per incident

This allows operators to move beyond theoretical MTBF values and adjust models based on empirical field performance.

.

2.6 Availability vs Customer Impact

Uptime metrics can mask the true impact of a service failure. For example:

  • A 10-minute outage on a low-use DSLAM impacts fewer users than a 2-minute core router failure that drops traffic for 200,000 customers

To address this, track:

  • Customer Impact Hours (CIH) = # affected users × duration of outage

  • Availability weighted by customer impact scope, not just time

This aligns technical calculations with business and experience-level outcomes.

.

2.7 Decide on Measurement Windows

Standardising your availability calculation window ensures comparability and regulatory alignment. Common intervals include:

  • 15-minute granularity (for performance correlation)

  • Daily or monthly summaries (for SLA tracking)

  • Annualised availability (for public or strategic reporting)

Use consistent rules:

  • Only count downtime after fault clock start

  • Exclude periods based on valid attribution rules (planned maintenance, uncontrollable events)

.

Step 3: Build Your Attribution Model – What Counts, What Doesn’t

Calculating service availability isn’t just about measuring when something goes down – it’s about defining whether that downtime should count against your performance metrics. The fairness and credibility of any assurance model hinges on a transparent, repeatable attribution framework that answers a deceptively simple question:

Was the operator at fault – or not?

A well-defined attribution model draws boundaries between:

  • Customer-induced issues

  • Planned work

  • Force majeure

  • True operator-side failures

This step shows how to apply those rules consistently and embed them into your operational processes.

3.1 Defining When the Fault Clock Starts and Stops

A “fault clock start” is the point in time when an operator begins officially counting the duration of a service-affecting incident for the purpose of:

  • SLA enforcement

  • Regulatory compliance

  • Availability or downtime reporting

  • Root cause analysis and operational accountability

It’s a crucial concept in service assurance, because how and when you start counting an outage directly affects reported metrics like:

  • Uptime percentage

  • Mean Time to Repair (MTTR)

  • Customer Impact Hours (CIH)

  • SLA breach determination

Let’s say a customer experiences a service issue, but:

  • They don’t report it for several hours

  • The NOC sees an alarm but isn’t sure it’s service-affecting

  • The issue clears before anyone logs a ticket

When do you start counting the outage?
That’s where each operator’s fault clock start policy comes in.

There is no global standard – operators typically define this based on service class, regulatory pressure, and operational policy.

Here are a few common approaches:

1. Customer-Reported Trigger

  • The clock starts only when the customer raises a ticket

  • Often used for best-effort services like residential broadband

  • Benefits: avoids over-counting outages customers never noticed

  • Risks: underestimates actual availability impact

2. System-Detected Trigger

  • The clock starts when monitoring systems or alarms detect a service-affecting fault

  • Often used for managed or enterprise services

  • Requires solid telemetry, alarm correlation, and fault isolation

  • Closer to real-time visibility

3. Hybrid Model

  • The clock starts at the earlier of ticket open time or alarm detection

  • May also include grace periods or validation workflows

  • Used for premium services with customer-facing SLAs

.

Likewise, the fault clock stop is triggered when:

  • The service is fully restored (confirmed by telemetry or ticket closure)

  • A workaround is implemented that restores service to acceptable levels

  • The customer confirms resolution (in some SLA-driven models)

.

When defining your fault clock policy, consider:

  • Service Class
    Premium or SLA-backed services may demand earlier fault clock start times

  • Detection Capabilities
    If you have poor telemetry or no CPE visibility, you may need to rely on tickets

  • CIOs and Exclusions
    If the issue is caused by a Customer-Induced Outage (CIO), do you:

    • Start the clock and then stop it upon RCA?

    • Never start it because the root cause absolves the operator?

  • Dispute Management
    Clear, documented fault clock rules help reduce billing and SLA disputes

.

3.2 Exclude Valid, Justified Downtime

Many industry frameworks agree that certain events should be excluded from downtime calculations – if specific criteria are met

Common exclusion types:

  • Planned Maintenance

    • Downtime is pre-notified to the customer (e.g. 48 hours in advance)

    • Falls within agreed maintenance windows

    • No SLA penalties apply if rollback plans exist and thresholds are not exceeded

  • Force Majeure Events

    • Natural disasters, civil unrest, widespread power failures, or upstream dependencies

    • Must be clearly logged, timestamped and justified

  • Customer-Induced Outages (CIOs)

    • ONT powered off, CPE misconfiguration, customer maintenance

    • Must be provable using TR-069, SNMP, logs or telemetry

    • RCA should clearly show that the root cause originated on the customer side

3.3 Be Consistent With Attribution Logic

Fairness is built on consistency. Create a policy that clearly outlines:

  • Which types of faults start the fault clock

  • How attribution is assigned (customer, operator, force majeure)

  • When downtime is excluded from SLA calculations

  • Who is responsible for making and reviewing these decisions

Best practice – use tagging or classification in your ticketing platform (e.g. “CIO”, “Planned”, “Regulatory Excluded”, “Operator Fault”) and tie these tags to reporting logic

3.4 Transparency – Use Evidence to Support Exclusions

You can’t exclude downtime just because it’s inconvenient. It requires transparency with the customer from the service definition / contract that they sign up to. However, it does still require the ability to build evidence requirements into your attribution model:

  • CIO Exclusions – requires logs showing power loss, CPE reboot, customer action

  • Planned Maintenance – requires notification timestamp, affected list, rollback steps

  • Force Majeure – must cite external documentation (e.g. weather alerts, power utility notices)

This ensures transparency during disputes and supports your reputation with regulators, enterprise clients and account teams

3.5 Tier Attribution by Service Class

Not all services need the same level of rigour. Define different attribution models for:

  • Best-effort services (e.g. residential broadband)

    • Ticket-based fault clock start

    • CIOs easily excluded

    • SLAs often not contractual

  • Premium services with SLAs

    • Early fault clock start (e.g. from alarm or test failure)

    • Manual or automated RCA required

    • Exclusions must be reviewed and logged

The tighter the SLA, the more robust and auditable your attribution logic must be

3.6 Manage Ambiguity and Edge Cases

Grey zones happen – for example:

  • Power outage at a site shared between operator and customer

  • CPE fault that also triggered a line card reset

  • Simultaneous upstream and access issues

To handle these fairly:

  • Define an attribution dispute workflow

  • Use an internal assurance review board to adjudicate tough cases

  • Maintain a record of override decisions for future consistency

.

3.7 Automate Attribution Logic Where Possible

Manual attribution doesn’t scale – especially across thousands of tickets and alarms per month. Build attribution into:

  • Your fault correlation platform

  • Ticketing workflows (automated tags based on root cause)

  • Planned event calendars

  • Telemetry pipelines that detect ONT/CPE state changes

Automation not only saves time – it ensures repeatable logic for audit, reporting and SLA compliance.

.

3.8 Don’t Conflate RCA with Attribution

Remember: Root Cause Analysis (RCA) and attribution are related – but they’re not the same.

  • RCA asks: What technically caused the issue

  • Attribution asks: Who is accountable for the downtime in SLA or availability terms

Some faults have shared causality – but in assurance models, one party must be assigned ownership for metric purposes

.

Step 4: Choose Standards-Based KPIs That Work for Your Network

KPIs are the lifeblood of any assurance model – but too many operators either overcomplicate them with academic standards or oversimplify them into meaningless dashboard numbers. The key is to choose KPIs that are measurable, meaningful and aligned with how services are actually delivered

This step helps you select and structure KPIs using well-established standards, while making them usable in hybrid, multi-vendor, multi-technology environments

4.1 Understand the KPI Categories that Drive Assurance

Most service assurance KPIs fall into a few functional categories:

  • Availability – is the network or service reachable and operational

  • Accessibility – can a user start a session (e.g. login, attach, establish call)

  • Retainability – once started, does the service persist successfully

  • Integrity – does the service maintain expected performance levels (e.g. speed, latency)

  • Mobility – for mobile services, are handovers seamless and non-disruptive

These categories can apply across technologies – fixed, mobile, hybrid, business, consumer

4.2 Use Standards to Define KPI Logic

Several global bodies define KPI standards. Rather than reinvent the wheel, map your assurance model to the relevant subset of these:

  • ITU-T Y.1564 – frame loss, delay, jitter, throughput, availability (Ethernet service activation)

  • ITU-T Y.1731 – OAM for Ethernet fault and performance monitoring

  • ETSI TS 132.451 – KPI definitions for LTE/UMTS networks (accessibility, retainability, integrity)

  • GSMA IR.42 – QoS parameters for mobile networks (voice, SMS, data)

  • GSMA IR.81 – roaming quality and monitoring models

These provide both definitions and formulas – e.g. FLR = Frames Lost / Frames Sent

Use these as your baseline, then adapt for local context or access-specific requirements

4.3 Align KPI Types to Technology Domains

No single KPI set works for every technology. Define KPI bundles for each service type:

  • GPON and FTTP

    • Throughput (down/up), session success, ONT uptime

    • OLT availability, ONU registration success

    • Delay/jitter if used for voice/IPTV

  • DSL/FTTN

    • Sync rate vs expected speed, session success, retrain counts

    • DSLAM port uptime, copper fault frequency

  • Mobile (LTE/5G)

    • Attach success, handover success, call drop rate

    • Average latency, cell availability, throughput per UE

    • RRC connection setup time

  • Roaming

    • IR.81: CSSR, PDP success, data session retainability, mean throughput

    • Delay in DNS resolution, Diameter response times

Match KPIs to the service class and technology limitations, rather than applying generic metrics across the board

4.4 Standardise Definitions Across Vendors

Multi-vendor networks often suffer from inconsistent KPI logic. For example:

  • Vendor A defines “session drop” based on RADIUS timeout

  • Vendor B defines it based on DSL retrain

  • Vendor C doesn’t track it at all

To ensure consistency:

  • Build a common KPI dictionary

  • Enforce KPI mapping at the assurance platform or data collector level

  • Normalise vendor-specific counters into cross-domain metrics

This avoids “apples-to-oranges” comparisons and enables reliable benchmarking

4.5 Use Measurement Windows that Match Operational Needs

Choose KPI reporting intervals that align with both:

  • The fault and performance detection window

  • The SLA or regulatory reporting cadence

Common windows:

  • 5 or 15 minutes – for correlation with fault or usage spikes

  • Hourly/Daily – for trend and incident detection

  • Monthly/Quarterly – for SLA reviews or customer reports

Ensure that:

  • KPI snapshots match your incident timelines

  • Measurement boundaries (e.g. midnight cut-offs) are clearly defined

4.6 Include Active and Passive KPI Collection

Don’t rely solely on what devices report. Build a combination of:

  • Passive KPIs

    • Logs, SNMP counters, flow data, alarm events

    • No customer impact, but limited by vendor instrumentation

  • Active KPIs

    • Test probes (e.g. HTTP pings, DNS, SIP), synthetic traffic, real-time checks

    • Useful for visibility in the absence of faults or tickets

Hybrid approaches ensure visibility even in silent failures, and support proactive detection

4.7 Build KPI Relevance into the Reporting Layer

A common mistake is surfacing too many KPIs at once. Prioritise relevance based on:

  • The service class (enterprise vs residential)

  • The target audience (NOC, execs, customers)

  • The decision needed (is this breach-worthy, trend-worthy, or just noise)

Structure reports to focus on violations, anomalies and impact – not just static graphs

.

Step 5: Integrate Assurance into OSS, BSS and SLA Workflows

Service assurance doesn’t deliver value in isolation. To be meaningful, it must be embedded across the systems that operate, monitor, ticket, and monetise network services. This step focuses on how to integrate your assurance framework into the core platforms that drive operational continuity and customer experience.

When done right, metrics don’t just live on dashboards – they power alarm suppression, SLA compliance tracking, capacity planning, churn reduction and closed-loop automation.

5.1 Connect Assurance to Network Inventory Systems for Topology Awareness

Assurance only works if you know which infrastructure supports each service. This is where Service Impact Analysis (SIA) engines tied to network inventory solutions are so important in your OSS stack.

  • Use a dynamic inventory system to link:

    • Access nodes to subscribers

    • VLANs to service tiers

    • Logical paths to physical ports or links

    • etc
  • Update inventory dynamically from:

    • Network discovery tools

    • Provisioning workflows

    • Orchestration platforms

    • Other platforms involved in asset lifecycle management (eg network designs, procurement, warehouse / spares, workforce management systems, etc)

This ensures that availability, performance and fault impact are correctly mapped to customer-facing services.

.

5.2 Fault / Alarm Management Systems

Assurance platforms should receive, correlate and act on:

  • SNMP traps and syslogs from network devices

  • Telemetry from access gear (e.g. ONTs, CPE, DSLAMs)

  • Alarms from transport and core networks

Key capabilities to enable:

  • Fault correlation – group related alarms across domains or layers

  • Root cause identification – isolate the triggering event from symptom cascades

  • Service impact tagging – flag which customers or services are affected

This enables faster triage, smarter ticketing and reduced NOC noise.

.

5.3 Drive Automation in Ticketing Workflows

The connection between assurance and ticketing is crucial – this is where theory meets operational reality

  • Automatically raise tickets when:

    • KPI thresholds are breached

    • Fault clocks start based on event detection

    • Active tests fail for critical services

  • Pre-fill tickets with:

    • Service ID

    • Platform and fault domain

    • Customer impact scope

    • Attribution tag (CIO, Planned, Operator)

  • Link resolution data (from field teams, NOC, auto-remediation tools) back to:

    • Fault closure times

    • MTTR calculation

    • Customer Impact Hours (CIH)

This builds a complete chain from detection to resolution, supporting both SLA tracking and RCA.

.

5.4 Plug Into SLA Engines and Contract Enforcement

SLA logic should consume assurance inputs to:

  • Track real-time SLA compliance by customer or service class

  • Generate violation reports automatically

  • Trigger escalations or credits based on breach logic

Examples:

  • If fault clock > 4 hours for Gold service – open escalation

  • For regulated services – auto-compile compliance report to meet disclosure schedules

.

5.5 Feed Executive and Operational Dashboards

Different roles need different views of assurance data. Tailor reports by persona:

  • NOC / Ops Teams

    • Live faults, open tickets, service-level impact

    • MTTR trends and platform reliability

  • Product / Service Managers

    • SLA conformance by service class

    • Repeat offenders by region or technology

  • Executives

    • Overall service health posture
    • Uptime by market segment

    • Improvement metrics over time

.

5.6 Enable Closed-Loop and AI-Driven Responses

As networks modernise and AIOps solutions become progressively more advanced, assurance data can trigger:

  • AI/ML systems for anomaly detection, predictive degradation and RCA suggestions
  • Improvements across all MTTx metrics
  • Auto-remediation scripts / automations (eg interface flaps, route reroutes, session resets)
  • Escalation workflows with prioritisation based on CIH or SLA risk

This moves assurance from observational to intent-driven, reducing manual overhead and improving resilience.

.

5.7 Align Assurance with Care and CRM Systems

Assurance data shouldn’t stop at the network edge (inward-facing perspectives) but assist with customer-facing actions as well, including:

  • Contact centre tools – to show which customers are impacted right now

  • CRM systems – to log availability breaches or trends on a per customer basis or other cohorts

  • Account management portals – to provide self-service SLA reports

This closes the loop between infrastructure, service experience and customer perception

.

Step 6: Metrics that Matter – Making Metrics Transparent and Customer-Facing

For many operators, service assurance is a black box – only a handful of engineers understand how metrics are defined, what counts as downtime or how SLAs are measured. But for customers and stakeholders to trust your metrics, you need transparency by design and contract.

This means exposing not just the outcomes, but the rules, exclusions and reasoning behind them – in language customers and account teams can understand. When metrics are visible and defensible, they become a competitive differentiator, not just a compliance or contractual requirement.

The following are some hints to strengthen your Service / Network Assurance Framework as a differentiator.

6.1 Distinguish Internal Metrics from Customer-Facing Metrics

Operators track many internal metrics that may not make sense to end users. Clarify which KPIs are:

  • For engineering use only (e.g. port-level uptime, internal path MTBF)

  • For operational use (e.g. SEV trends, MTTR, alarm counts)

  • For customer-facing communication (e.g. service availability %, CIH, SLA breaches)

Make sure internal and external views are mapped but not conflated, to avoid disputes and misunderstandings.

.

6.2 Create a KPI Glossary That Customers Can Actually Read

Every customer report should include a definitions section that:

  • Explains each metric in plain language

  • Clarifies measurement windows (eg 24/7 monthly vs business hours)

  • Shows how exclusions (eg planned, CIO, force majeure) are applied

  • Provides example calculations where appropriate

This reduces confusion and builds confidence in the process. It also reduces the effort expended by your team on disputes.

.

6.3 Clearly Document and Share Attribution Logic

As described earlier, customers should never be surprised by how availability was calculated. Provide a transparent high-level summary of:

  • What events are counted vs excluded

  • How CIOs are identified

  • When the fault clock starts and stops

  • What supporting evidence (tickets, telemetry) is required

You don’t have to expose every line of code – but you should explain the logic and fairness model behind your attribution.

.

6.4 Offer Transparent SLA Reporting Dashboards

Give customers access to:

  • Availability stats over time (by service, region, etc.)

  • Active and resolved incidents

  • SLA compliance summaries

  • Downtime attribution (eg operator, planned, customer)

Dashboards should be designed to:

  • Answer common customer questions

  • Show patterns (not just single events)

  • Be exportable for auditing or internal reviews

Where possible, include self-service tools so customers can:

  • Query their own metrics

  • Raise SLA challenges with context

  • Download usage and downtime logs

.

6.5 Prepare for Disputes With Evidence-Backed Reporting

Even with great transparency, disputes will happen. Build reporting that:

  • Links SLA violations to ticket IDs, timestamps and impact data

  • Stores RCA results and attribution tags

  • Flags cases where SLA logic was overridden by exception process

This allows you to defend SLA metrics in:

  • Contractual SLA reviews

  • Regulator inquiries

  • Account management meetings

.

6.6 Use Transparency as a Commercial Differentiator

In highly competitive markets, transparency can be a powerful value-add. Operators that provide:

  • Real-time service visibility

  • Explainable downtime

  • Evidence-backed reporting

…are more likely to retain enterprise customers, avoid SLA disputes and build long-term trust.

This is especially true for:

  • Government contracts

  • Managed service agreements

  • Regulated wholesale environments

Transparency isn’t just operational – it provides competitive / commercial leverage.

6.7 The Use of Active and/or Passive Collection

Service assurance depends on the quality, completeness and credibility of the data collected. That data typically comes from two main sources: passive monitoring and active testing. Each approach offers unique strengths and limitations, and leading operators use both in tandem to gain full visibility into network and service performance.

Passive monitoring refers to the collection of performance and fault data that’s generated during normal network operations. As a result, it tends to be more complete. This includes SNMP counters, syslogs, alarm traps, flow records, and telemetry collected through protocols like TR-069 or TR-369.

These methods are efficient and non-intrusive, making them ideal for establishing baseline network health, monitoring device-level reliability, and supporting long-term trend analysis. However, passive methods rely entirely on what the infrastructure is capable of reporting.

If a CPE, ONT or intermediate device has limited instrumentation, or if telemetry is disabled or fails, visibility can be compromised. Passive metrics are also less effective in detecting silent degradations—instances where service quality drops below expected levels without triggering a fault or customer complaint.

In contrast, active monitoring involves intentionally generating traffic or synthetic tests to simulate customer usage and measure real-time service quality of adjacent services.

Because additional infrastructure is required, these active probes are often only set up with one probe per network segment (eg agg ring, POI, etc). This means they don’t provide as significant coverage and only show an indicative service experience, not the service experience of any actual customers.

The active tests can include ICMP pings, HTTP GET requests, DNS resolution checks, or synthetic voice and video tests. More sophisticated examples include ITU-T Y.1731-based tests for Ethernet services, which assess frame delay, jitter, loss and end-to-end availability using standardised OAM mechanisms.

Active methods are powerful because they allow operators to see what the customer sees (albeit an indicative customer), even if no alarms have been raised and no faults have been logged. They’re especially valuable in multi-domain or multi-vendor environments, where passive correlation is limited, or in scenarios where services traverse unmanaged segments like internet exchanges or OTT partner networks.

The best service assurance strategies combine both approaches. Passive methods are ideal for wide-scale observability across the entire network and for building historical reliability baselines. Active testing, on the other hand, is essential for detecting intermittent issues, validating SLA commitments, and confirming service restoration in real time. For example, an operator might use TR-069 to pull performance telemetry from managed CPE devices, while simultaneously using active HTTP and ping probes to validate reachability of key services. On an Ethernet backhaul circuit, Y.1731 might run continuously in the background, while the assurance platform cross-checks results against SNMP counters and alarm history to confirm integrity.

Integrating active and passive inputs also improves the accuracy of fault attribution. For instance, if passive monitoring shows a healthy link but synthetic DNS probes consistently fail, the assurance platform may trigger a fault clock start even before a customer raises a ticket. Conversely, active success alongside passive alarms may help confirm a false positive or justify SLA exclusions. In either case, dual-layer visibility ensures that the assurance system reflects true service experience, not just device behaviour.

As networks become more dynamic, disaggregated and customer-centric, with more advanced blue/green release strategies, the case for active testing grows stronger. Used wisely, synthetic tests don’t just confirm uptime – they prove availability in a way that customers and regulators can trust.

.

Step 7: Drive Continuous Improvement and Year-on-Year Maturity

A good assurance model tells you what happened
A great assurance model helps you get better

Whilst five-nines might be a commonly cited target, the reality is that every operator’s environment is different, so there’s no global target. Instead, whatever metrics you’ve gathered in the past, your true success in network/service assurance is continued improvement.

However, this can be asymptotic (diminishing returns) in nature. It’s easier to improve when coming from a low initial resilience level than when you’ve already had years of refined improvements. Yet continual improvement is still the target, and not just through gaming the metrics (eg changing attribution logic / exclusions), but by ongoing improvement activities.

This final step is about using the metrics, attribution logic and platform integrations you’ve already built to continuously refine operations, improve service reliability and reduce customer impact over time. True assurance maturity isn’t measured by uptime – it’s measured by how quickly you adapt, recover and improve.

In many cases, it’s about getting more granular or more specific with what you measure and manage.

Raw availability percentages (e.g. 99.95%) are useful, but they don’t show how you’re evolving. As just a tiny fraction of examples, start tracking:

  • Customer Impact Hours (CIH)

    • Number of affected customers × duration of fault

    • Weighted availability scores that reflect customer scope

    • CIH by domain, service type, etc
  • Recurring fault domains

    • Repeat outages on the same DSLAM, ring, OLT, node, device type, batch of equipment, etc

    • Service types or regions with higher MTTR

  • Mean Time to Detect (MTTD)

    • Time between fault occurrence and detection

    • Critical for identifying monitoring gaps

  • Mean Time to Acknowledge (MTTA)

    • Time from detection to ticket creation or NOC response

    • Reveals operational bottlenecks

  • Mean Time to Resolve (MTTR)

    • Time from acknowledgement to resolution

    • Benchmarked across platforms and teams

These metrics go beyond “was it up” and show how fast you react and recover. These may require more advanced solutions or faster response mechanisms from your assurance / AIOps / Observability solutions.

Once you have good observability and historical trends, layer in:

  • Anomaly detection based on past failure patterns

  • Predictive fault modelling using telemetry and ML

  • Automated remediation workflows for known issue signatures

  • Intent-based monitoring – e.g. “Keep latency under 10 ms on Tier 1 paths”

This shifts assurance from “measure the past” to “protect the future”

.

Summary

This guide presented a seven-step framework that enables operators to build a complete, standards-aligned, and customer-facing network and service assurance model. Starting with a clear distinction between network and service assurance, the framework walks through availability modelling using MTBFs and redundancy logic, defines consistent and defensible fault attribution rules, and aligns KPI selection with ITU, ETSI and GSMA standards across technologies like GPON, DSL and LTE.

The framework goes further by embedding assurance into OSS and BSS workflows, delivering transparent SLA reporting to customers, and driving operational maturity through metrics like Customer Impact Hours, MTTR and RCA analysis. By transitioning from reactive monitoring to predictive, closed-loop assurance, operators can reduce customer impact, build internal accountability and use service assurance as both a performance management tool and a strategic business enabler.

Still to be Added:

  1. Diagram of nodal vs E2E 
  2. Diagram of active vs passive monitoring
  3. Detailed description of monitoring technologies like TR-069, etc
  4. Other metrics that matter in relation to managing SA/NA in a NOC
  5. SLA measurements, rebates, etc
  6. Sample service contract snippets that provide clarity around availability to set customer expectations
  7. Specific considerations for availability in different types of networks (eg GPON, HFC, satellite, DSL, etc)
  8. List of KPIs and acronyms table
  9. Deeper-dive into the standards docs
  10. Section about Knowledge Management and its relationship to NA/SA
  11. Insert description around diagram with the various MTTx metrics defined (below)

.

Acknowledgements:

Hat tip to Paul, Ray, Stevo, Deepak, Carolyn, Evan, Clayton and Tony for your inputs / insights into this article!

.

Next Steps

As you know already, each network and solution environment is unique, so it follows that each Network / Service Assurance challenge has it’s own unique drivers. They’re all complex, they’re all different. Hopefully the step-by-step guide shown above has helped you map out your Network / Service Assurance Framework.

If you’re seeking further assistance, we’d be delighted to assist.

To discuss ways you can optimise the effectiveness of your next OSS/BSS transformation, start with Step A and Request an Appointment.