Amazon explains outage that took out a large chunk of the internet | Engadget

Amazon has explained the web services outage that took parts of the internet offline for several hours on December 7, and promised more clarity if this happens in the future. What CNBC reports, Amazon revealed an automated capacity escalation feature caused “unexpected behavior” from clients on the internal network. The devices connecting that internal network to AWS were flooded, paralyzing communications.

The nature of the failure prevented teams from identifying and fixing the problem, Amazon added. They had to use logs to find out what happened, and internal tools were affected as well. Rescuers were “extremely deliberate” in restoring service to avoid breaking still-functional workloads, and they had to deal with a “latent problem” that was preventing network clients from backing down and giving systems a chance to recover.

The AWS division has temporarily disabled the scale that caused the issue and will not enable it again until solutions are available. A fix for the latent flaw will arrive within two weeks, Amazon said. There is also additional network settings to protect devices in the event of a repeat failure.

Crises may be easier to understand next time. A new version of the AWS Service Health Dashboard will be released in early 2022 to provide a clearer view of any outages, and a multi-regional support system will help Amazon get in touch with customers much earlier. These won’t bring AWS back any faster during an incident, but they can remove some of the mystery when services go down, which is important when victims include everything from Disney + to Roomba vacuums.

All products recommended by Engadget are selected by our editorial team, independent of our parent company. Some of our stories include affiliate links. If you buy something through one of these links, we may earn an affiliate commission.

Leave a Comment