How a typo took down S3, the backbone of the internet

Earlier this week, much of the internet ground to a halt when the servers that power them suddenly vanished. The servers were part of S3, Amazon’s popular web hosting service, and when they went down they took several big services with them. Quora, Trello, and IFTTT were among the sites affected by the disruption. The servers came back online more than four hours later, but not before totally ruining the UK celebration of AWSome Day.

Now we know how it happened. In a note posted to customers today, Amazon revealed the cause of the problem: a typo.

On Tuesday morning, members of the S3 team were debugging the billing system. As part of that, the team needed to take a small number of servers offline. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” Amazon said. “The servers that were inadvertently removed supported two other S3 subsystems.”

The subsystems were important. One of them “manages the metadata and location information of all S3 objects in the region,” Amazon said. Without it, services that depend on it couldn’t perform basic data retrieval and storage tasks.

After accidentally taking the servers offline, the various systems had to do “a full restart,” which apparently takes longer than it does on your laptop. While S3 was down, a variety of other Amazon web services stopped functioning, including Amazon’s Elastic Compute Cloud (EC2), which is also popular with internet companies that need to rapidly expand their storage.

Amazon said S3 was designed to be able to handle losing a few servers. What it had more trouble handling was the massive restart. “S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected,” the company said.

As a result, Amazon said it is making changes to S3 to enable its systems to recover more quickly. It’s also declaring war on typos. In the future, the company said, engineers will no longer be able to remove capacity from S3 if it would take subsystems below a certain threshold of server capacity.

It’s also making a change to the AWS Service Health Dashboard. During the outage, the dashboard embarrassingly showed all services running green, because the dashboard itself was dependent on S3. The next time S3 goes down, the dashboard should function properly, the company said.

“We want to apologize for the impact this event caused for our customers,” the company said. “We will do everything we can to learn from this event and use it to improve our availability even further.”

Recommended by Outbrain

Comments

AWS is great.

Thing is accidents happen, they are re-mediating the issues and as such it will become more robust than before. Good on them.

Yup. I sort of pity the poor engineer – jeff bezos likely lost it that day.

It’s insanely hard to correct for human error that has never occured at a previous level of scale. Well done Amazon overall.

We learn through mistakes.

We don’t learn through successes. Which is why it’s so weird when people give talks based on their successes.

I got here by making a ton of mistakes, ha!

Point of clarification: EC2 wasn’t down, but you couldn’t start new EC2 servers as the images that serve as the base operating system are stored on S3.

Any running EC2 servers were operational, assuming you didn’t need more or shut any down.

During the outage, the dashboard embarrassingly showed all services running green, because the dashboard itself was dependent on S3.

The shut down must’ve turned off all the fences. Damn it, even Nedry knew better than to mess with the raptor fences.

Haha, so I wasn’t the only one getting JP vibes off of this! I keep imagining Andrew Jassy and some IT guy arguing about shutting down the systems

While S3 did go down, what this really showed us is how many developers/engineering organizations foolishly put all their eggs in a single DC, single cloud basket.

If you’re not doing multi-dc, and even better, multi-cloud, you’ll be susceptible to this regardless of being on Azure, GCE, or AWS.

And all of a sudden cost advantage of the cloud is gone. But sure, you’d have a more reliable solution than if you’d depend on a single location/single cloud provider. And a lot more reliable than on-prem hosting!

For some users, the extra cost of redundancy is not worth avoiding a couple of hours of downtime per year on average. But they may rethink their cost-benefit calculations after this incident.

Multiple Cloud providers is not a problem from a cost benefit basis if you set them up correctly. Having an account with a cold, up to date VM costs nothing because it’s not running – if your site goes down you spin up the alternative.

snapchat for instance uses aws and google (comes to mind on account of it being in the news lately)

Welp. At least one Amazon engineer’s next performance review is not going to go so well.

Human errors are inevitable, the only sin is not learning from them.

This is the wrong attitude for a manager to take.

When someone makes a mistake, if they admit it, learn from it and become a stronger employee, then why punish them? It’s a costly learning experience, but it highlighted some faults in the system, not a fault in the individual. Imagine if this hadn’t happened for another 2 or 3 years, AWS would be far larger than it is now, the system could have been down for days instead of hours. This was a good thing.

But then, you’re probably not a manager of people. Co-workers should absolutely give him heck about this. Forever. Hey Bob, remember when you crashed the internet?

Hey, S3 didn’t go down. It was experiencing ""increased error rates."

"Alternate-Stats"

I wish I could recommend this a hundred times.

The backbone of the internet? Really? Sure there are some useful services there, but S3 doesn’t even come close to qualifying as the backbone. Maybe it’s a chunk of your kidney, kinda painful when it dies, but you can totally live without it.

Ench, that last sentance was kinda poetic. Hmmm…. I like your style kid. I like your style…

View All Comments
Back to top ↑