Facebook’s Outage

Sophisticated IT Infrastructure is Double-Edged Sword
Michelle DeFiore

Today, many knowledge workers found a mainstay of their productive lives unavailable – Facebook. Yes, to the horror of over a billion users, the world’s most popular social networking website was down for over 30 minutes despite having one of the most sophisticated IT infrastructures in the industry.

A company so advanced it redesigned its own software to increase energy efficiency should be nearly immune to downtime, or at the very least, should be able to rapidly recover – right? The reality is that this type of occurrence is not unique to Facebook. In fact, a large number of tech firms struggle to succeed at the very difficult task of ensuring availability, resiliency and performance of IT infrastructure. In 2014, alone, we saw a number of notable IT failures by tech giants:

  • On January 10 Dropbox… dropped. And it stayed that way for nearly the entire weekend.
  • On January 24 Gmail became unavailable for nearly 30 minutes. Certainly not the colossal outage of Dropbox, but long enough to cause heart palpitations for Gmail users.
  • On May 16 designers were left with free time on their hands when Adobe’s Creative Cloud service went down for nearly 28 hours.
  • On June 19 earnings and productivity for several Fortune 2000 organizations experienced a major boost when Facebook went down. An experience similar to the one today.
  • On June 29 many corporate workers experienced something very rare – distraction free work. Thanks to Microsoft’s Exchange Online Service going offline for about nine hours.

And that’s only halfway through the year!  So, what can we learn from these notable failures? All is lost and ensuring good IT service is impossible? No, not really. I think it’s more that IT infrastructures are incredibly complex. Understanding how all of the different elements in your IT stack relate is a must when providing good service.  It isn’t just a must because users get upset when IT services are down, it’s a must because downtime can cause serious financial impact to the business. Consider a rather crude and not very accurate cost estimate for Facebook.

As of March 2014, Facebook made roughly $15,000 in revenue every minute. If one argued (again, with not a lot accuracy) that all of that revenue were subject to the availability of their service, this little 40-minute downtime cost Facebook $600,000. To better illustrate the complexity of today’s IT infrastructure, let’s take a simple web application.

If a user calls to the help desk complaining that they can’t get to their web application, is it the application itself that is experiencing a hiccup? Or is it the OS? How about the virtual machine it is running on? Could it be the bare metal infrastructure the VM is running on top of? Maybe it’s the network connecting the application to the storage it is using? What about the network connection between the end user and the application? See, it’s complex.

Thankfully there are tools out there that can discover and map the different elements making up an infrastructure and the dependencies between those elements. That makes it much simpler to find the underlying cause of an issue and resolve that issue before users are impacted. Considering past experiences, who will be the next IT heavyweight to be taken down by a very public outage?

What are your thoughts and opinions on Facebook’s most recent outage? Leave a comment below!

Request a demo
Request a demo

Share This Post

Most Popular

Archive

Comments

  • Stéphane Lacasse

    What you need:
    – An integrated documentation system
    I’ve seen places where the documentations are spread across various systems like, a folders with pdf/docx/html/text format, in email boxes, Wikis, etc. Documentation should be centrally managed, standardized, structured, searchable, comprehensible and details integrated into a big picture.
    – A good monitoring system
    One that support triggers dependencies which, when well configured, should pin-point you what and where the problem is, exactly or close to it.
    – a complete and up-to-date disaster recovery plan
    Going through the continuous process of maintaining a DRP is to ask yourself what-if for any imaginable scenarios. This provide contingency plans based on a more abstract level based on gravity, impact (on reputation, in cost), outage time, and so on, and what to do in each case. It also identifies weaknesses where solutions can be preemptively be applied, like adding redundancy where there are single points of failure.
    – Logs everything centrally
    With a central log system, you can easily correlate in one view what happened on which system.
    – Test, test, test.
    Test your setup in simulators, your upgrades in VMs, etc. Many solutions exists. I, for one, used VMs and GNS3 to simulate our infrastructure as close to reality as possible and tried to break it. If I did, I implemented a fix, then tried to break it again.
    Perform plan drills so that your people can practice what to do in various situations. You will surely identify problems (team coordination, replacement parts, missing information, etc…) and apply solutions.
    – Inform your users/clients
    When there is an outage, inform your users and/or your clients what is the problem, what you are doing to fix it and an ETA. They will appreciate to know that they are your primordial concern and that you’re working hard for them.
    – Mea Culpa
    After each outage, gather as much information as possible, analyze the problems, find fixes to prevent other occurrences, document it, adjust your monitoring, adapt your DRP.

    I’m sure Facebook’s DC apply everything stated here and more. In my mind, it is simply not possible to think of everything. Sometime you takes chances because the risk/cost ratio is simply not worth leaving hanging single-points-of-failure.

    Outages in IT will always happen, just like planes crashes will. The goal is to have as few as possible.