Facebook’s Outage

Today, many knowledge workers found a mainstay of their productive lives unavailable – Facebook. Yes, to the horror of over a billion users, the world’s most popular social networking website was down for over 30 minutes despite having one of the most sophisticated IT infrastructures in the industry.

(Image via TechCrunch)

A company so advanced it redesigned its own software to increase energy efficiency should be nearly immune to downtime, or at the very least, should be able to rapidly recover – right? The reality is that this type of occurrence is not unique to Facebook. In fact, a large number of tech firms struggle to succeed at the very difficult task of ensuring availability, resiliency and performance of IT infrastructure. In 2014, alone, we saw a number of notable IT failures by tech giants:

  • On January 10 Dropbox… dropped. And it stayed that way for nearly the entire weekend.
  • On January 24 Gmail became unavailable for nearly 30 minutes. Certainly not the colossal outage of Dropbox, but long enough to cause heart palpitations for Gmail users.
  • On May 16 designers were left with free time on their hands when Adobe’s Creative Cloud service went down for nearly 28 hours.
  • On June 19 earnings and productivity for several Fortune 2000 organizations experienced a major boost when Facebook went down. An experience similar to the one today.
  • On June 29 many corporate workers experienced something very rare – distraction free work. Thanks to Microsoft’s Exchange Online Service going offline for about nine hours.

And that’s only halfway through the year!  So, what can we learn from these notable failures? All is lost and ensuring good IT service is impossible? No, not really. I think it’s more that IT infrastructures are incredibly complex. Understanding how all of the different elements in your IT stack relate is a must when providing good service.  It isn’t just a must because users get upset when IT services are down, it’s a must because downtime can cause serious financial impact to the business. Consider a rather crude and not very accurate cost estimate for Facebook.

As of March 2014, Facebook made roughly $15,000 in revenue every minute. If one argued (again, with not a lot accuracy) that all of that revenue were subject to the availability of their service, this little 40-minute downtime cost Facebook $600,000. To better illustrate the complexity of today’s IT infrastructure, let’s take a simple web application.

If a user calls to the help desk complaining that they can’t get to their web application, is it the application itself that is experiencing a hiccup? Or is it the OS? How about the virtual machine it is running on? Could it be the bare metal infrastructure the VM is running on top of? Maybe it’s the network connecting the application to the storage it is using? What about the network connection between the end user and the application? See, it’s complex.

Thankfully there are tools out there that can discover and map the different elements making up an infrastructure and the dependencies between those elements. That makes it much simpler to find the underlying cause of an issue and resolve that issue before users are impacted. Considering past experiences, who will be the next IT heavyweight to be taken down by a very public outage?

What are your thoughts and opinions on Facebook’s most recent outage? Leave a comment below!

Request a demo