Automating the Work of Hundreds

Today, the FBAR service [Facebook Auto-Remediation] is developed and maintained by two full time engineers, but according to the most recent metrics, it’s doing the work of approximately 200 full time system administrators. FBAR now manages more than 50% of the Facebook infrastructure and we’ve found that services have dramatic increases in reliability when they go under FBAR control. Recently, we’ve opened up development of remediation plugins to other teams working on Facebook’s back end services so they can implement their service-specific business logic. As these teams write their own remediation plugins, we’re expanding FBAR coverage to more and more of the infrastructure. This is making the site more and more reliable for end users while reducing the workload of the supporting engineers.”
Alethea Power

I’ve only just stumbled across this really interesting article (see link above) on Facebook’s FBAR service thanks to a new article on SDxCentral. What’s even more interesting is that the original link above was authored back in 2011 when FBAR was launched.

Today, FBAR sifts through 3.37 billion notifications from network devices each month, filtering out noise down to roughly 750,000 alarms that need action, Najam Ahmad [Facebook’s director of network engineering] said. Of those, FBAR resolves 99.6 percent of the alarms without human intervention, according to Ahmad.
To achieve that kind of automation, Facebook has put heavy emphasis on modular, standardized network hardware controlled by software
According to this article.

Gotta love those hyper-scale networks (eg Facebook’s, Google’s, etc) for providing the impetus to solve the big problems that OSS is also wrestling with. Interesting too that they’re taking a different approach using the bright in-house resources rather than using existing products / vendors due to scaling.

I especially love the emphasis on modular, standardised hardware at Facebook. The fewer the variants in network devices and configurations, the easier it is to set up these types of automations (albeit still not easy!).

I equate this approach to the Southwest Airlines analogy (see yesterday’s blog). I also wonder whether the next-gen OSS product suites developed by traditional vendors should take a closer look at the way that the hyper-scalers are tackling the issue of resolving faults. Virtualised networks, wireless sensor networks and big data analytics are going to take the networks/services we manage to vastly larger scales in coming years anyway.

What do you think? Do we need to reconsider our approaches and learn from the hyper-scalers?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.