Servers powering Amazon Web Services went down last Friday evening, causing outages for Netflix, Instagram, and Pinterest, and now the company has issued a thorough explanation of exactly what happened. An electrical spike in the company's Northern Virginia data centers was the primary cause, but it was a glitch in Amazon's backup power systems that ultimately caused the servers to go dark and take down some of the web's biggest services.
The company uses a two-tiered approach to backup power — uninterrupted power supplies, commonly knows as UPS batteries, and large electrical generators. All but one data center successfully transitioned to generator power, but the one that did not continued on UPS power for as long as it could. Eventually this power gave out, creating a 10-minute blackout of a section of Amazon's servers. These 10 minutes were only the beginning of Amazon's problems, as it still took several hours for the servers to be rebooted and to assure that their filesystem integrity hadn't been compromised. Even after Amazon got its servers back online, Netflix and Instagram remained in the dark until sometime Saturday — while Amazon's outage was clearly a major issue, it appears there were other factors that delayed service restoration.
Amazon's explanation of the event is less of an apology to end users and more of an attempt to reinstill confidence in its enterprise customers. Uptime is serious business in the server industry, and Amazon's detailed description of its backup power and load balancing mechanisms shows that it takes this metric very seriously — especially with Google Compute Engine now seemingly looking to tackle the same market.