clock menu more-arrow no yes

Filed under:

Facebook explains the backbone shutdown behind its global outage on Monday

New, 10 comments

And why it took hours to fix

Illustration by Alex Castro / The Verge

The massive outage that took down Facebook, its associated services (Instagram, WhatsApp, Oculus, Messenger), its platform for businesses, and the company’s own internal network all started with routine maintenance.

According to infrastructure vice president Santosh Janardhan, a command issued during maintenance inadvertently caused a shutdown of the backbone that connects all of Facebook’s data centers, everywhere in the world.

That by itself is bad enough, but as we’ve already explained, the reason you couldn’t use Facebook is that the DNS and BGP routing information pointing to its servers suddenly disappeared. According to Janardhan, that problem was a secondary issue, as Facebook’s DNS servers noted the loss of connection to the backbone and stopped advertising the BGP routing information that helps every computer on the internet find its servers. The DNS servers were still working, but they were unreachable.

The lack of network connections and loss of DNS cut off the servers from engineers trying to fix the issue and disabled many of the tools they normally use for repair and communication — just as we heard yesterday.

The blog post notes that the engineers had additional hurdles due to the physical and system security around this crucial hardware. Once they did “activate the secure access protocols” (this is apparently not a code word for “cut open the server door with an angle grinder), they were able to get the backbone online and slowly restore services in gradually increasing loads. That’s part of the reason it took some people longer to get access back yesterday, as the power and computing demands of turning everything on at once might have caused more crashes.

So that’s it. No conspiracy theories, and no techs taking axes to secure facilities to turn Mark Zuckerberg’s baby back on. Just a bug in a command that an audit tool missed, and for six hours, services that connect billions of people disappeared.


Related: