MTBF for virtualised networks

Under certain engineering assumptions … the failure rate for a complex system is simply the sum of the individual failure rates of its components, as long as the units are consistent, e.g. failures per million hours. This permits testing of individual components or subsystems, whose failure rates are then added to obtain the total system failure rate
Wikipedia.

When writing a recent blog about Data Centre Infrastructure Management (DCIM), it struck me that the additional hierarchy described in its dot points are effectively adding extra complexity and extra layers of things that could go wrong.

Without doing any calculations to back this up, it seems logical that the MTBF (Mean Time Between Failures) of any multi-layer solution (virtualised networks in this case) would be greater than a solution that has far fewer layers (physical networks).

But tied to this, a virtual solution that has more layers also means more complexity for the OSS to manage. In my opinion, one of the main reasons a high percentage of OSS projects fail to deliver on time, budget and/or functionality is complexity, so our objective should be to significantly reduce complexity rather than increase it. Similarly, we should seek to improve solution reliability by reducing complexity.

Whilst SDN and NFV solutions may seem to offer efficiency improvements, I still can’t help but thinking that the increased complexity in their overarching OSS and management suites (not to mention related process complexities) will retard some of the efficiencies gained in virtualised networking.

 

Read the Passionate About OSS Blog for more or Subscribe to the Passionate About OSS Blog by Email

2 thoughts on “MTBF for virtualised networks

  1. One of the main benefits of virtualization, at least NFV style virtualization, if that the process is less coupled to the underlying hardware. So, while there are more parts to potentially fail, all layers have greater resilience because:
    – In the event of hardware failure VMs can spin up elsewhere (in the worst case) or you have a physically distributed group of processes already balanced across physical devices (in the best case).
    – In the event of software/process failure, you can have a hot standby process ready to replace it within (milli-)seconds.

    This does requires virtualized resources to be designed in a different way to their physical equivalent, and perhaps this will impact telco protocols too: If you had, say, a virtual RNC function, do the various protocols of 3G/SAE leverage the potential to spin up a new, replacement virtual RNC in the event of an outage, and do so gracefully? I don’t know the answer to that.

  2. Hi James,

    Very valid point! Overall system resilience of virtualised solutions should be taken into consideration when reading my post.

    Perhaps the statement that “the failure rate for a complex system is simply the sum of the individual failure rates of its components” doesn’t hold true for resilient systems? I should look into that one!

Leave a Reply

Your email address will not be published. Required fields are marked *