There was a brilliant article from Matt Kapko over at SDXCentral last week entitled, “AWS Outage Stresses Telco Cloud Challenges.” It specifically highlighted lengthy outages on AWS in December and the downstream impact cloud outages can have for telcos that have dependencies on third-party cloud providers.
“The benefits of public cloud are clear — efficiency, scalability, and the ability to consolidate functions with less equipment. For telco operators this translates to better economics, business agility, and accelerated innovation at the pace of software,” explained Don Alusha in the article.
That’s all completely true and it makes sense for telcos to leverage public cloud. There are a few considerations to take into account though:
- Despite fantastic infrastructure resiliency mechanisms, cloud providers aren’t immune from outages
- Carriers aren’t immune from outages either, so I’m not taking sides relating to infallibility in this article. [BTW. If any reader is aware of confirmed metrics that show security and availability comparisons between cloud and carrier infrastructure I’d love to see them!!]
- Carrier infrastructure “tends” to be more localised, so outages may take out services within a region or perhaps even a country. That means subscribers are impacted within the effected area
- However, with an increasing number of carriers leveraging cloud infrastructure, cloud outages are likely to lead to a more global impact area
- In the past, carriers tended to build their own active networks (core infrastructure), but 5G is changing that paradigm, as will all future network models that embrace virtualisation and cloud-native concepts
- Carriers traditionally owned and managed their own infrastructure and therefore the network was within its locus of control (ie carriers could prioritise what got fixed and allocate resources to optimise management of the infrastructure, around both business-as-usual and catastrophic outages)
- Leveraging cloud infrastructure means the telcos no longer have as much ability to prioritise or control the events relating to fault restoration
- And it’s not just network infrastructure that’s impacted here. When carriers have OSS and BSS in the cloud, they lose the ability to manage the network, systems and even the workforce during an outage.
If we (simplistically) think of networks being the data / customer plane and OSS/BSS as being the control plane, then with a cloud outage we have the potential to lose both planes. In most traditional telco outages, it’s either the control OR the data plane that’s impacted, not both planes. Again, cloud outages will tend to have broader impact
This adds an interesting additional layer of complexity into our High Availability (HA) planning doesn’t it? We previously generated HA designs for our network and HA designs for our OSS/BSS (in isolation more or less). But if both of these are just overlays on cloud infrastructure, then we’re abstracting HA design as well as services to the cloud operators.
What do we do to overcome this?
Don Alusha further explains, “Operators should hedge their bets in alternative and competing cloud platforms to change the structure of current systems and processes to produce more of what is desirable and less of that which is undesirable.”
Well, that’s true. HA models tend to be built around diversity, avoiding any Single Points of Failure (SPoF). But how does that impact the design of our OSS and BSS to ensure we maintain control of our control plane? Do we design our solutions to be decoupled and stretched across different regions/zones and even cloud providers??
Does this slideshare from Kai Waehner provide a few thoughts for event-streaming platforms?
I’d love to get your thoughts on all of this as I certainly don’t have the answers to these conundrums!!
BTW. Matt’s article made me think back to autonomy being the OSS Security Elephant in the Room in this earlier article. which was in turn inspired by this article about 5G security by Bert Hubert. They might be worth a read too!