In our previous article, we described five new OSS/BSS automation design rules applied by Elon Musk. Today, we’ll continue on into the second part of the Starbase video tour and lock onto another brilliant insight from Elon.
From 5:10 to 11:00 in the video below, Elon and Tim (of Everyday Astronaut) discuss why failure, in certain scenarios, is a viable option even in the world of rocket development.
An edited excerpt from the video is provided below.
“[Elon] We have just a fundamentally different optimization for Starship versus say, like the polar extreme would be Dragon. Dragon, there can be no failures ever. Everything’s gotta be tested six ways to Sunday. There has to be tons of margin. There can never be a failure ever for any reason whatsoever.
Then Falcon is a little less conservative. It is possible for us to have, say, a failure of the booster on landing. That’s not the end of the world.
And then for Starship, it’s like the polar opposite of Dragon: we’re iterating rapidly in order to create the first ever fully reusable rocket, orbital rocket.
Anyway, it’s hard to iterate, though, when people are on every mission. You can’t just be blowing stuff up ’cause you’re gonna kill people. Starship does not have anyone on board so we can blow things up.
[Tim] Are you just hoping that by the time you put people on it, you’ve flown it say 100, 200 times, and you’re familiar with all the failure modes, and you’ve mitigated it to a high degree of confidence.
Interesting…. Very interesting…
How does that relate to OSS? Well, first I’d like to share with you another story, this time about pottery, that I also found fascinating.
“The ceramics teacher announced on opening day that he was dividing the class into two groups. All those on the left side of the studio, he said, would be graded solely on the quantity of work they produced, all those on the right solely on its quality. His procedure was simple: on the final day of class he would bring in his bathroom scales and weigh the work of the “quantity” group: fifty pounds of pots rated an “A,” forty pounds a “B,” and so on. Those being graded on “quality,” however, needed to produce only one pot—albeit a perfect one—to get an “A.” Well, come grading time and a curious fact emerged: the works of the highest quality were all produced by the group being graded for quantity. It seems that while the “quantity” group was busily churning out piles of work—and learning from their mistakes—the “quality” group had sat theorizing about perfection, and in the end had little more to show for their efforts than grandiose theories and a pile of dead clay”
David Bayles and Ted Orland in their book, “Art and Fear: Observations on the Perils (And Rewards) of Art making,” which I discussed previously in this post years ago.
These two stories are all about being given the licence to iterate…. within an environment where there was also a licence to fail.
The consequences of failure for our OSS/BSS usually falls somewhere between these two examples. Not quite as catastrophic as crashing a rocket, but more impactful than creating some bad pottery.
But even within each of these, there are different scales of consequence. Elon presciently identifies that it’s less catastrophic to crash an unmanned rocket than it is to blow up a manned rocket, killing the occupants. That’s why he has different platforms. He can use his unmanned Starship platform to rapidly iterate, flying a lot more often than the manned Dragon platform by reducing compliance checks, redundancy and safety margins.
So, let me ask you, what are our equivalents of Starship and Dragon?
Answer: Non-Prod and Prod environments!
We can set up non-production environments where it’s safe to crash the OSS/BSS rocket without killing anyone. We can reduce the compliance, run lots of tests / iterations and rapidly learn, identifying the risks and unknowns.
With OSS/BSS being software, we’re probably already doing this. Nothing particularly new in the paragraph above. But I’m less interested in ensuring the reliability of our OSS/BSS (although that’s important of course). I’m actually more interested in ensuring the reliability of the networks that our OSS/BSS manage.
What if we instead changed the lens to using the OSS/BSS to intentionally test / crash the (non-production or lab) network (and EMS, NMS, OSS/BSS too perhaps)? I previously discussed the concept of CT/IR – Continual Test / Incremental Resilience here, which is analogous to CI/CD (Continuous Integration / Continuous Delivery) in software. CT/IR is a method to automatically, systematically and programmatically test the resilience of the network, then using the learnings to harden the network and ensure resilience is continually improving.
Like the SpaceX scenario, we can use the automated destructive testing approach to pump out high testing volumes and variants that could not occur if operating in risk-averse environments like Dragon / Production. Where planned, thoroughly tested changes to production may only be allowed during defined and pre-approved change windows, intentionally destructive tests can be run day and night on non-prod environments.
We could even pre-seed AIOps data sets with all the boundary failure cases ready for introduction into production environments without them having ever even been seen in the prod environment.