Amazon Web Services (AWS) has provided an explanation as to what caused the outage that downed parts of its own services, as well as the third-party websites and online platforms that utilize AWS. In a post on the AWS website, the company explains that an automated process caused the outage, which began around 10:30AM ET in the Northern Virginia (US-EAST-1) region.
The spike in congestion prevented the company’s operations team from using its real-time monitoring system
“An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network,” Amazon’s report says. “This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.”
According to the report, this issue even impacted Amazon’s ability to see what exactly was going wrong with the system. It prevented the company’s operations team from using the real-time monitoring system and internal controls that they typically rely on, explaining why the outage took so long to fix. Amazon notes that service started didn’t start improving until 4:34PM ET, and the issue was fully resolved at 5:22PM ET.
Since Amazon’s Support Contact Center also runs on the AWS network, customers weren’t able to create support cases for seven hours during the outage. Amazon’s Service Health dashboard, which the platform uses to provide status updates, was also impacted, resulting in Amazon’s delayed acknowledgment of the issue. The company says that it’s working on a way to improve its response to outages, and plans on releasing a revamped version of the Service Health Dashboard that should help customers across receive timely updates if an outage occurs.