Network Resilience Engineering: The Role of Operational Support Systems (OSS)

Operational Support Systems (OSS) perform many business-critical functions for network operators, as shown in the diagram below. They are the connectors and the profit engine behind any communications network.

As indicated above, the assurance flows help to retain revenues and customers. OSS are an insurance policy for a telco’s brand, or at least the technology aspects of reputation and brand value. A major contributor to that reputation arises from a telco’s ability to offer the right services, at the right time, in the right place, with no interruptions to customer service expectations.

With our ever-increasing reliance on these networks, understanding and enhancing their resilience is crucial. A seminal paper by James P.G. Sterbenz et al., titled “Evaluation of Network Resilience, Survivability, and Disruption Tolerance” provides a comprehensive approach to evaluate network resilience through analysis, simulation and experimentation.

The authors present the ResiliNets framework for resilient, survivable and disruption-tolerant network architecture and design. The framework is based on a two-phase resilience strategy and a set of design principles emphasising heterogeneity, redundancy and diversity.

Within the context of Operational Support Systems (OSS), there are two threads:

OSS contribute to resilience of the network under management – across the entire network lifecycle:
- From resilient network planning and design improvements
- Through to AIOps and other assurance solutions
- Then onto closed-loop mechanisms with humans and/or machines in the loop (which will be referenced in more detail in the Resilinets D2R2+DR concept below)
But OSS and related systems also need concerted resilience engineering principles themselves, to remain available to manage the network and customer services

The ResiliNets strategy consists of two phases: D2R2+DR (as shown in the diagram below), where the D2R2 phase (inner ring) includes Defending, Detecting, Remediating, and Recovering, which are activities undertaken in real-time for a system to rapidly adapt to challenges and maintain an acceptable level of service. The DR phase (outer ring) includes Diagnosing faults and Refining future behavior, which are background operations that observe and modify the behaviour of the D2R2 cycle.

Reproduced with permission from James P.G. Sterbenz et al, from their paper titled “Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines” https://resilinets.org/papers/Sterbenz-Hutchison-Cetinkaya-Jabbar-Rohrer-Scholler-Smith-2010.pdf

The authors have prepared an extensive library of publications in relation to their Resilinets strategy, architecture, framework and simulations.

In one of these papers, “Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines,” the authors also categorise a fundamental set of concepts for network resilience engineering that includes 4 axioms, 6 strategies and 17 resilience principles, as capture in the diagram below.

Derived from the paper by James P.G. Sterbenz et al, titled “Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines”

The principles (right-most column above) have been further broken down into the following categories:

Prerequisites

Service requirements need to be determined to understand the level of resilience the system should provide.
Normal behaviour is the process of learning the baseline of a network’s normal operational parameters and behaviours.
Threat and challenge models are essential to understanding and detecting potential adverse events and conditions.
Metrics and Targets quantifying the service requirements and operational state are needed to measure the operational/health posture of the network to then detect and measure resilience.
Heterogeneity in mechanism, trust, and policy are the realities that no network comprises a single technology, nor is appropriate for all scenarios, so disparate makes / models / topologies / etc abound throughout all networks. Moreover, choices change as time progresses. Therefore, resilience mechanisms must cope with variability in (and interconnections between) disparate link technologies, addressing, routing, signalling, interfaces and much more.

Design Tradeoffs

Resource tradeoffs determine the deployment of resilience mechanisms. The relative composition and placement of these resources must be balanced to optimise resilience and cost.
Complexity of the network results due to the interaction of systems at multiple levels of hardware and software. Increased complexity makes it difficult for humans and machines to understand, monitor and manage networks, which inextricably threatens resilience.
State management is an essential part of any large complex system. It is related to resilience in two ways: First, the choice of state management impacts the resilience of the network. Second, resilience mechanisms themselves require state and it is fundamentally important how they manage state.

Enablers

Self-protection and security are essential properties to defend against challenges faced in / by a network.
Connectivity and association among communicating entities should be maintained when possible based on eventual stability.
Redundancy in space, time, and information increases resilience against faults / challenges / threats if defenses are penetrated.
Diversity is closely related to redundancy, but has the key goal to avoid fate sharing. Diversity in space, time, medium, and mechanism increases resilience against localised challenges.
Multilevel resilience composition is needed to understand and manage resilience from the bottom-up from components to systems (incorporating networks, OSS and more).
Context awareness is needed whereby resilient nodes are required to monitor the network environment and detect adverse events or conditions.
Translucency is needed to control the degree of abstraction vs. the visibility between levels, particularly with the advent of increased virtualisation and abstraction occurring in networks and network controllers / orchestrators.

Behaviour needed for Resilience

Self-organising and autonomic behaviour is necessary for network resilience that is highly reactive with minimal human intervention, especially as networks become more programmable and dynamic in nature.
Adaptability to the network environment is essential for a node in a resilient network to detect, remediate and recover from challenges (not just rules-based for known challenges, but more holistically to also cope with unknown challenges). Evolvability is needed to adapt to challenges to respond to emerging threats, as well as changes in network architecture, protocols, applications, use-cases and demands.

These fundamental principles represent a fantastic framework for building a resilient system (of networks and systems) and for ongoing evaluation of resilience.

A variety of different tools within a network operator’s OSS stack will play a pivotal role in implementing these strategies. OSS, with their capabilities in network design, network rollout and workforce management, network configuration / management, service fulfillment and network / service assurance, are instrumental. They help by defending against network disruptions, detecting anomalies, mounting incident responses, facilitating collaboration between domain experts across complex technology estates and aiding in the rapid recovery of services. They provide the necessary tools for diagnosing faults and refining future behavior, thereby enhancing network resilience and recovery.

The seminal work of the Resilinets team has created a comprehensive approach to facilitate improved network resilience – presenting a framework, strategy and principles as well as methods to simulate, measure and improve. The overlapping role of OSS in this context is also crucial, providing the necessary tools and capabilities to implement these strategies and principles.

If you have even a passing interest (or obligation) in ensuring your networks and systems are robust and reliable (or build OSS solutions responsible for same), I strongly recommend this body of work from the Resilinets team. It will undoubtedly prove invaluable for anyone involved in network / OSS design and management, as it provides a roadmap for building and maintaining resilient networks in the face of the many day-to-day challenges that OSS attempt to manage.

June 15, 2023
Ryan

If you found this article useful or valuable, subscribe (in the top-right corner of this page) and share. Let's spread the word and inspire more people to become passionate about OSS. Ryan is Passionate About OSS and has dedicated the last two decades to sharing his passion for OSS with the world. He is a founder, author, blogger, Engineer, connector and inquisitive learner about OSS and managing networks. To find out a little about his back-story and why he's so Passionate About OSS, click on the About Page. To connect with Ryan and the PAOSS team, click on the Contact page.

All Posts