Network Operations Ninja Academy (NONA) of the Future

We were honoured to be guests on the Zero-Touch Telecom (ZTT) show last week. The discussion mainly revolved around a highly problematic skills gap that’s likely to widen between network masters and apprentices once we introduce significant automation via tools such as AIOps and the like.

In response, Michael P. asked a great question, “With the shift towards automation, what are some innovative ways we can cultivate the next generation of network experts?

If we look at the 12 dilemmas we face on the journey to zero-touch operations, you’ll notice that Dilemma #8 ponders a very similar question,

“If the ninjas of the future no longer have a comprehensive apprenticeship (dilemma #2) and are dealing only with outages that have never been seen before (dilemma #4), then we arguably no longer have a ninja team that can diagnose root-cause from memories of past patterns. “

Creating a “Network Operations Ninja Academy (NONA)” that prepares individuals to handle unprecedented network outages (black swan events) requires a shift from traditional network ops training methods to a multidisciplinary approach. It’s clear that such an approach would need to emphasise psychological resilience, adaptive thinking to handle unforeseeable situations and cross-functional leadership of experts.

As Geoff Hollingworth pointed out in the preamble to the ZTT show, this would seem to align with special-forces military training (not that I’ve ever been on any special-forces training).

This academy would not just be a training ground but a think-tank for pioneering network solutions, ultimately creating a cadre of “ninjas” adept at managing and resolving crises that have never been faced before. The question then, is how might one construct a Network Operations Ninja Academy?

Here are a few thoughts on how I might build a NONA, but I’d love to hear thoughts from you too dear audience, in the comments below!

1. Recruitment Strategy

  • Profile Identification: The attributes being sought by a NONA are arguably as much about personality traits as they are about network skills. Therefore, there would be a focus on recruiting individuals who not only have a diverse background in network engineering but also demonstrate strong problem-solving skills, adaptability, and stress resilience. Like special forces, it might not be possible to predict the people with the required personality traits, but to identify them through stress-testing (see item #3 below)
  • Diverse Backgrounds: However, it’s important to note that network engineering / operations experience might not even be a mandatory requirement for entry into the NONA program. Instead, it would encourage applications from varied fields such as psychology, military, aviation, and emergency response to foster diverse thinking and approaches. The network experience can be taught more easily than personality changes

2. Curriculum Design

  • Core Technical Skills: Deep dives into advanced network technologies, system architecture, network operations, blast-zone modelling, in-band / out-of-band management networks, virtualisation and data processing / science, as well as the use of OSS/BSS, AI/ML, simulation, test automation and cybersecurity tools.
  • Non-Technical Skills: Perhaps equally important are the non-technical skills – operations models, escalation pathways, ITSM processes, especially change planning, pre/during/post cutover test & monitoring, major incident management (MIM) and Post Incident Review (PIR)
  • Data Gathering Skills: Deep dives into the technologies that allow for data to be collected for real-time and/or long-term trendlining. Also learning how to gather information in degraded situations such as when networks (or portions of networks) are unreachable or inoperable
  • Scenario-Based Learning: Develop complex, unpredictable network failure scenarios that require innovative thinking and rapid problem-solving. Develop combinational and cascading failure scenarios across different network domains, power supply  faults, configuration / change situations, physical equipment failure, software patches, runaway automations, high-availability failover / failback, cyber-attacks, breakdown in communications, human failures and much more. Past events, including seemingly innocuous ones, when layered together with other events, can help to design new scenarios to test
  • Psychological Resilience Training: Include cognitive behavioural approaches to test for, but then enhance decision-making under pressure
  • Leadership and Communication: Courses on leadership styles, team dynamics, negotiation, and effective communication across diverse and multi-disciplinary teams of experts (internal and external). One of the most important elements of any disaster recovery (DR) situation is the communications required to keep stakeholders informed (and engaged if necessary). Stakeholders range from external notifications to impacted customers (noting that normal forms of communication, such as their comms network, might be down), internal teams, executives, vendor / partner experts, media agencies, etc, etc.
  • Ninja Master Academy: The discipline of building the senseis who design the NONA curriculum, training courses, environments and scenarios that test the trainees

3. Innovative Training Models

Hands-on experience is a fundamental component of the NONA. Machines like AIOps and automation engines will remove the possible on-the-job learning and pattern-recognition, so these need to be replicated in new and novel ways.

  • Disaster Recovery Simulations and War Rooms: Most telcos perform scheduled DR simulations / tests. The NONA would take this to the next level in terms of cadence and sophistication. Life-like DR simulations with real network environments are required to scheduled simulations that mimic high-pressure outage situations where trainees must respond in real-time.
  • Rotational Apprenticeships: Partner with various tech firms to provide on-the-job training in different network environments (internal and external placements, not just classroom training). It also seems important for apprentices to be seconded through the various phases of the ITIL life-cycle below – from strategy, to design, to transition to operations and continuous improvement:
  • Feedback-Driven Learning: Implement a continuous feedback loop where each simulation is followed by a detailed analysis of decision-making processes and outcomes

4. Technology Integration

  • Test Automation: A key requirement for simulating unknown / unforeseeable scenarios is to create combinations, cascades and/or volumes of events that are unlikely to ever be seen in the wild. Test automation harnesses would appear to be the mechanism to best generate never-before-seen network failures, enhancing trainee exposure to unpredictable scenarios
  • Destructive Environments: A decision is to be made about whether chaos monkey scenarios of intentional destruction can be released on production environments to enforce greater resilience or, more likely, on non-prod environments that mimic production. The latter is to be built knowing that destructive testing will be performed on it. This also means that roll-back or roll-forward mechanisms are required to quickly bring them back to readiness for the next batch of testing / training
  • Virtual Reality (VR) and Augmented Reality (AR): As AR/VR become increasingly likely to enter our future ways of working could we even begin to employ VR/AR to create immersive troubleshooting environments that mimic real-world chaos in network operations?
  • New Data Science, New Visualisation, New OSS / Testing Tools: Just like special forces teams, the NONA may need to make special-purpose tools that the rest of the telco’s operations team don’t have access to. This means designing UI / UX for complexity minimisation, hypothesis / variant testing at scale, resolution speed and collaboration rather than volume and escalation of typical tools
  • Out of Band (OOB) Communications: As mentioned earlier, in the worst outage situations, the “normal” forms of communication (ie the carrier’s network) are impacted. This could be in terms of accessing management traffic and the telemetry data that they carry to inform essential decision-making. It could also be in the form of communicating the problem and estimated time to repair to the outside world. DR plans should assume that typical communications paths are unavailable and OOB mechanisms are required

5. Continuous Professional Development

  • Updates on New Technologies and Methods: Regular updates and training sessions on emerging technologies and methodologies in network operations. Regular updates on new processes and org chart / ops-model changes. Updates on, and review of, upcoming changes that are seen as high-risk
  • Alumni Network: Establish a strong alumni network that allows for ongoing knowledge sharing and mentorship among past and current academy members. But this will probably also extend to knowledge sharing between NONAs of different network operators and vendor / supplier organisations as well

6. Support and Wellbeing

With NONA training designed to put trainees in high-stress situations, it’s possible that there could be a mental toll. Therefore it seems important to put the right support and wellbeing programs in place to support these trainees:

  • Mental Health Resources: Provide robust support systems including access to psychological counselling to help manage stress and prevent burnout.
  • Mindfulness and Stress Management: Regular workshops on mindfulness, yoga, and other stress management techniques.

7. Assessment and Certification

  • Real-Time Performance Tracking: Utilise advanced tracking and analytics to assess trainee performance in real-time during simulations. These aren’t necessarily binary assessments. They could also include metrics such as time to repair, time to restore, time to recover, percentage of data recovered, etc.
  • Certification Programs: Should certification be localised, on a per-network basis for the unique context there, or should there be an industry-wide recognition, endorsing the skills in handling unprecedented network issues?

There’s sure to be plenty of layers that I’ve managed to overlook from this NONA plan. What have I missed? Please leave us a comment below.

If this article was helpful, subscribe to the Passionate About OSS Blog to get each new post sent directly to your inbox. 100% free of charge and free of spam.

Our Solutions

Share:

Most Recent Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.