One of the many features of OSS is that they’re great at monitoring network performance and alerting operators of any threshold breaches. Initially these thresholds were static. For example if the CPU utilisation on a router goes above x% then an alert gets raised.
But this technique has flaws in that there might be recurring peaks (such as 9am on Monday morning when everyone is synching their field data / jobs) that are normal and predictable yet raise alerts to operators.
When recently performing a review of Opmantek they showed how they had rolled the work of Igor Trubin and his SEDS concept into their solution.
Please check out the link above to learn how this simple, yet elegant solution can reduce false anomaly notifications and the corresponding operator interruptions. I especially like how Figure 7 shows an example of a deteriorating system long before reaching failure state.
What do you think? Do you have any other great predictive / anomaly / trending algorithms to recommend?