Avoiding telco black swans: Insights from the airline industry

For many of us, telco network downtime is more of a nuisance than a critical event. But for some mission critical and life-critical use-cases (like e-Health scenarios), telecommunications networks are certainly not just nice-to-have. They have to be available for every second of every day of every year.

It has recently dawned on me that there are a large proportion of Sev1s (severity 1 incidents that cause significant customer and network impact) and near-misses that are considered “black swan” events because they have never been seen or contemplated by the telco previously. Most telco networks are already incredibly resilient, with a lot of thought being assigned to mitigating potential failure scenarios. They also tend to have multiple layers of resiliency built in (for five-nines – 99.999% availability or higher). Therefore in many cases, Sev1s are actually caused by multiple failures simultaneously (often obscure failures).

By “multiple failures,” I don’t mean failures that cascade from a single issue, but multiple issues arising at once (eg a fibre cut on the working path and then a card failure or software failure that prevents customer traffic from failing over to the resilient / protect path according to the resiliency plan). They’re often multi-domain in nature too, such as an unusual combination of a transmission failure, an outside plant failure, a routing problem, a power failure, weather event, accident, planned change, etc, etc.

However, it also seems likely that the black swan (or something quite similar) has probably been experienced by other carriers around the world, therefore making it not truly a black swan event. Someone in the world has probably already known about the scenario and/or figured out how to fix it or circumvent it.

During post-incident reviews (PIRs) of these Sev1 events, the telcos and/or their vendors take the time to forensically analyse to capture the incident “fingerprint” to ensure it doesn’t happen again. However, as far as I can tell, telcos and vendors don’t have a mechanism to systematically share fingerprints with others in the industry. Some vendors might share knowledge amongst their other customers, but in most cases I suspect it’s probably more on a person-by-person experience rather than a systematic sharing of knowledge.

With an increasing use of AI/ML for pattern / trend analysis to identify and/or codify anomalies, not to mention chaos engineering techniques, we’re going down a path of more systematic “fingerprinting.” Perhaps this makes it easier for vendors (or the industry more broadly) to consider how to share their outage or near-miss knowledge.

By comparison, I use the analogy of how knowledge about catastrophic incidents in the airline industry are widely shared (eg by Boeing, Airbus, etc) with all their airlines. Or the lengths that car-builders go to on recalls, even though it’s often the parts provided by third-party suppliers that have systemic problems that need resolution.

However, perhaps I’m just naïve about this situation. I know it’s not an easy problem to solve, but perhaps I’m oblivious to initiatives that are already underway. Do you know of any information sharing that already happens within vendors, across vendors, across industry, by standards bodies, etc with regards to black swan events?

PS. If you’re a network vendor, OSS vendor, etc, I’d love to hear from you if you already offer this type of service to your customers.

If this article was helpful, subscribe to the Passionate About OSS Blog to get each new post sent directly to your inbox. 100% free of charge and free of spam.

Our Solutions

Share:

Most Recent Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.