Amazon has given an explanation for the outage which brought down Netflix’s streaming service on Christmas Eve, saying all it took was one absent-minded developer.
The outage was caused by a break in service at Amazon Web Services, the company’s cloud platform, which began shortly after midnight PST on December 24.
And it was caused, says Amazon, by the deletion of some of its Elastic Load Balancing Service (ELB) state data used manage the configuration of the ELB load balancers in the region.
“The data was deleted by a maintenance process that was inadvertently run against the production ELB state data,” says the team.
“This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time.”
At this point, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers. But at first, the team couldn’t work out what was going on, as not all the APIs were failing: many customers were able to create and manage new load balancers, but not manage existing ones.
It wasn’t until 5:00 pm that the team realized what was going on and disabled several of the ELB control plane workflows, including the scaling and descaling workflows, to prevent additional running load balancers from being affected by the missing ELB state data.
But the problems continued. “The team attempted to restore the ELB state data to a point-in-time just before 12:24 PM PST on December 24th (just before the event began). By restoring the data to this time, we would be able to merge in events that happened after this point to create an accurate state for each ELB load balancer,” says the ASWS team.
“Unfortunately, the initial method used by the team to restore the ELB state data consumed several hours and failed to provide a usable snapshot of the data. This delayed recovery until an alternate recovery process was found.”
Amazon is now attempting to reassure users that it can avoid this sort of disruption in future, most notably by modifying the access controls on its production ELB state data to prevent modification without specific Change Management (CM) approval.
It’s also modified its data recovery process and says it’s confident that it could recover ELB state data much more quickly if anything similar happens again. But this will be small comfort to users who must be wondering how safe it can ever be to rely so totally on a third party for their service.