As you already know, there are two categories of downtime – unplanned (eg failures) and planned (eg upgrades / maintenance).
Planned downtime sounds a lot nicer (for operators) but the reality is that you could call both types “incidents” – they both impact (or potentially impact) the customer. We sometimes underestimate that fact.
Today’s question is whether you’re able to identify where the hotspots are in your OSS suite when you combine both types of downtime. Can you tell which outages are service-impacting?
In a round-about way, I’m asking whether you already have a dashboard that monitors uptime of all the components (eg applications, probes, middleware, infra, etc) that make up your complete OSS / BSS estate? If you do, does it tell you what you anecdotally know already, or are there sometimes surprises?
Does the data give you the evidence you need to negotiate with the implementers of problematic components (eg patch cadence, the need for reliability fixes, streamlining the patch process, reduction in customisations, etc)? Does it give you reason to make architectural changes (eg webscaling)?