Where are the reliability hotspots in your OSS?

As you already know, there are two categories of downtime – unplanned (eg failures) and planned (eg upgrades / maintenance).

Planned downtime sounds a lot nicer (for operators) but the reality is that you could call both types “incidents” – they both impact (or potentially impact) the customer. We sometimes underestimate that fact.

Today’s question is whether you’re able to identify where the hotspots are in your OSS suite when you combine both types of downtime. Can you tell which outages are service-impacting?

In a round-about way, I’m asking whether you already have a dashboard that monitors uptime of all the components (eg applications, probes, middleware, infra, etc) that make up your complete OSS / BSS estate? If you do, does it tell you what you anecdotally know already, or are there sometimes surprises?

Does the data give you the evidence you need to negotiate with the implementers of problematic components (eg patch cadence, the need for reliability fixes, streamlining the patch process, reduction in customisations, etc)? Does it give you reason to make architectural changes (eg webscaling)?

Read the Passionate About OSS Blog for more or Subscribe to the Passionate About OSS Blog by Email

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.