Just as Facebook’s Antigone Davis was live on CNBC defending the company over a whistleblower’s accusations and its handling of research data suggesting Instagram is harmful to teens, its entire network of services suddenly went offline.
The outage started just before noon ET and took nearly six hours before it was resolved. This is the worst outage for Facebook since a 2019 incident took its site offline for more than 24 hours, as the downtime hit hardest on the small businesses and creators who rely on these services for their income.
Facebook issued an explanation for the outage on Monday evening, saying that it was due to a configuration issue. On Tuesday afternoon, Facebook engineers offered more detail, explaining that the company’s backbone connection between data centers shut down during routine maintenance, which caused the DNS servers to go offline. These two factors combined in making the problem more difficult to fix, and they help explain why services were offline for so long.
Instagram.com was flashing a 5xx Server Error message, while the Facebook site merely told us that something went wrong. The problem also affected its virtual reality arm, Oculus. Users could load games they already have installed, and the browser works, but social features or installing new games didn’t.
After failing all tests for most of Monday, a test of ISP DNS servers via DNSchecker.org showed most of them successfully finding a route to Facebook.com at 5:30PM ET. A few minutes later, we were able to start using Facebook and Instagram normally; however, it may take time for the DNS fixes to reach everyone.
Facebook engineers were dispatched to the company’s US data centers to try and fix the problem
On Twitter, Facebook communications exec Andy Stone says, “We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience.” Mike Schroepfer, who will step down from his post as CTO next year, tweeted, “We are experiencing networking issues and teams are working as fast as possible to debug and restore as fast as possible.”
Inside Facebook, the outage broke nearly all of the internal systems employees use to communicate and work. Several employees told The Verge they resorted to talking through their work-provided Outlook email accounts, though employees can’t receive emails from external addresses. Employees who were logged into work tools such as Google Docs and Zoom before the outage can still use those, but any employee who needs to log in with their work email was blocked.
On Monday we learned that Facebook engineers were sent to the company’s US data centers to try and fix the problem, according to two people familiar with the situation.
A peek at Down Detector (or your Twitter feed) reveals the problems were widespread. While it’s unclear exactly why the platforms were unreachable for so many people, their DNS records show that, like last week’s Slack outage, the problem is apparently DNS (it’s always DNS).
Cloudflare senior vice president Dane Knecht notes that Facebook’s border gateway protocol routes — BGP helps networks pick the best path to deliver internet traffic — were suddenly “withdrawn from the internet.” While some have speculated about hackers, or an internal protest over the whistleblower testifying before Congress, Facebook has blamed the problem on a bug that occurred during routine maintenance.
Update October 4th, 3:37PM ET: Added additional information about the outage.
Update October 4th, 4:15PM ET: Added statement from Facebook CTO Mike Schroepfer, along with internal Facebook updates.
Update October 4th, 5PM ET: Noted outage is still ongoing, added information about the 2019 outage.
Update October 4th, 5:35PM ET: DNS updates suggest Facebook is closing in on a solution.
Update October 4th, 6:08PM ET: Facebook.com is back online.
Update October 4th, 10:29PM ET: Added information about Facebook’s explanation.
Update October 5th, 2:29PM ET: Added more background details on the backbone network problem that caused the outage.