Lightning Strikes Amazon's Cloud (Really) 109
The Register has details on a recent EC2 outage that is being blamed on a lightning strike that zapped a power distribution unit of the data center. The interruption only lasted around 6 hours, but the irony should last much longer. "While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down. But customers were also able to wait for their original instances to come back up after power was restored to the hardware in question."
**typo** should be: is NOT written out (Score:3, Informative)
Sorry about that.
Re:Do any of you know how they survived? (Score:3, Informative)
Re:Inconcievable! (Score:5, Informative)
Well it does seem like it was pretty resilient:
While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down.
So basically a set of servers went down, and it took down the particular instances running on those servers. Customers were still able to take the same exact image and start new instances-- it sounds like immediately. Now sure, it'd be nice if they worked out some kind of automatic clustering and failover to take care of this sort of thing for you, but when my server goes down with my dedicated host, I don't have the option to start up a new host immediately with the same exact configuration.
Re:Do any of you know how they survived? (Score:4, Informative)
Typically, the raid controller will have enough on board capacity to clear it's write cache before losing power entirely. While the drive array will be connected to a decent UPS that can hold for at least a few minutes. Meanwhile, the server itself will also likely be connected to the same UPS, or a different one.
The real question at hand is, were the UPS between the power distribution node and the server, or were they on the other side of the distribution node, and therefore worthless in a case like this? I've seen both configurations, but the latter is rarer. Not because of this particular case, but because of efficiency concerns.
If there was a failure of design, it was most likely in the building wiring itself. The building was clearly not properly grounded against lightning strikes, as if it was, the surge would never have hit the internal wiring. It might have kicked the building off the grid for a time, but it should never have reached a power distribution node. Although it's likely the outcome would be similar if not identical.
Re:Lightning once striked our office building. (Score:5, Informative)
"Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from failure scenarios."
"you can protect your applications from failure of a single location"
Re:It evidently did (Score:3, Informative)
If you want to guarantee data integrity and consistent data between your instances, then you cannot tolerate one out of two going down. Byzantine agreement protocols can tolerate less than one third failures, so you would actually need four to tolerate one failure.
Re:Do any of you know how they survived? (Score:5, Informative)
Re:Inconcievable! (Score:3, Informative)
EC2 instances don't contain instance data. The GP is correct. State data is generally stored on S3, on shared storage, or using their db interface.
Re:It evidently did (Score:3, Informative)
Only one of Amazon's two zones went down
There are two regions (US and EU) each with several availability zones (US currently has four.) The AZ's are designed to be isolated from one another. This outage affected one AZ in the US region.
If you are doing load balancing across instances in multiple AZ's (or even using Amazon's own Elastic Load Balancing and Auto-Scaling features) you would have been fine, since this is exactly the kind of problem they're designed to handle.