I like to compare OSS projects with chaos theory. A single butterfly flapping it’s wings (eg a conversation with the client) can have unintended consequences that cause a tornado (eg the client’s users refusing to use a new OSS).
The day-to-day operation of a network and its management tools can be similarly sensitive to seemingly minor inputs. We can never predict or test for every combination of knock-on effects. This means that forecasting the future is impossible and failure is inevitable.
If we take these two statements to be true, it perhaps changes the way we engineer our OSS.
How many production OSS (and/or related EMS) do you know of whose operators have to tiptoe around the edges for fear of causing a meltdown? Conversely, how many do you know whose operators would quite happily trigger failures with confidence, knowing that their solution is robust and will recover without perceptibly impacting customers?
How many of you could confidently trigger scheduled or unscheduled outages of various types on your production OSS to introduce the machine learning seeding technique discussed yesterday?
Would you be prepared to unleash the chaos monkeys on your OSS / network like Netflix is prepared to do on its production systems?
Most OSS are designed for known errors and mechanisms are put in place to prevent them. Instead I wonder whether we should be design systems on the assumption that failure is inevitable, so recovery should be both rapid and automated.
It’s a subtle shift in thinking. Reduce the test scenarios that might lead to OSS failure, and increase the number of intentional OSS failures to test for recovery.
PS. Oh, and you’d rightly argue that a telco is very different from Netflix. There’s a lot more complexity in the networks, especially the legacy stacks. Many a telco would NEVER let anyone intentionally cause even the slightest degradation / failure in the network. This is where digital twin technology potentially comes into play.