Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
It's funny.  Laugh. Earth Hardware

Lightning Strikes Amazon's Cloud (Really) 109

The Register has details on a recent EC2 outage that is being blamed on a lightning strike that zapped a power distribution unit of the data center. The interruption only lasted around 6 hours, but the irony should last much longer. "While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down. But customers were also able to wait for their original instances to come back up after power was restored to the hardware in question."
This discussion has been archived. No new comments can be posted.

Lightning Strikes Amazon's Cloud (Really)

Comments Filter:
  • by mr_stinky_britches ( 926212 ) on Friday June 12, 2009 @11:50PM (#28317199) Homepage Journal
    **typo** should be: is NOT written out

    Sorry about that.
  • by ShadowRangerRIT ( 1301549 ) on Saturday June 13, 2009 @12:04AM (#28317261)
    UPS, or backup generator, or some other equivalent system that gives just enough power for a clean shut down. I've seen blades with built-in UPS (possibly not even a battery, just a capacitor) that exists solely to sync to disk in the event of a power loss.
  • Re:Inconcievable! (Score:5, Informative)

    by nine-times ( 778537 ) <nine.times@gmail.com> on Saturday June 13, 2009 @12:17AM (#28317307) Homepage

    Well it does seem like it was pretty resilient:

    While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down.

    So basically a set of servers went down, and it took down the particular instances running on those servers. Customers were still able to take the same exact image and start new instances-- it sounds like immediately. Now sure, it'd be nice if they worked out some kind of automatic clustering and failover to take care of this sort of thing for you, but when my server goes down with my dedicated host, I don't have the option to start up a new host immediately with the same exact configuration.

  • by KahabutDieDrake ( 1515139 ) on Saturday June 13, 2009 @12:43AM (#28317415)
    You've never actually worked with enterprise class gear have you? It's standard for most of the servers and all of the data storage to have capacitance/battery backups for just such an emergency.

    Typically, the raid controller will have enough on board capacity to clear it's write cache before losing power entirely. While the drive array will be connected to a decent UPS that can hold for at least a few minutes. Meanwhile, the server itself will also likely be connected to the same UPS, or a different one.

    The real question at hand is, were the UPS between the power distribution node and the server, or were they on the other side of the distribution node, and therefore worthless in a case like this? I've seen both configurations, but the latter is rarer. Not because of this particular case, but because of efficiency concerns.

    If there was a failure of design, it was most likely in the building wiring itself. The building was clearly not properly grounded against lightning strikes, as if it was, the surge would never have hit the internal wiring. It might have kicked the building off the grid for a time, but it should never have reached a power distribution node. Although it's likely the outcome would be similar if not identical.
  • No, they don't. You're either being disingenuous, or idiotic.

    "Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from failure scenarios."

    "you can protect your applications from failure of a single location"

  • Re:It evidently did (Score:3, Informative)

    by kasperd ( 592156 ) on Saturday June 13, 2009 @01:42AM (#28317657) Homepage Journal

    Only one of Amazon's two zones went down so a well designed cloud app shouldn't have failed.

    If you want to guarantee data integrity and consistent data between your instances, then you cannot tolerate one out of two going down. Byzantine agreement protocols can tolerate less than one third failures, so you would actually need four to tolerate one failure.

  • by sirsnork ( 530512 ) on Saturday June 13, 2009 @03:08AM (#28317955)
    RAID Controllers have batteries so they can remember whats in the cache (for about 48hours), not so they can write that data out to disks befoer they power off. When power is returned and thr disks come back up the cache is flushed before any other action, thereby keeping the array in one piece
  • Re:Inconcievable! (Score:3, Informative)

    by jcnnghm ( 538570 ) on Saturday June 13, 2009 @10:01AM (#28319555)

    EC2 instances don't contain instance data. The GP is correct. State data is generally stored on S3, on shared storage, or using their db interface.

  • Re:It evidently did (Score:3, Informative)

    by friedo ( 112163 ) on Saturday June 13, 2009 @10:31AM (#28319713) Homepage

    Only one of Amazon's two zones went down

    There are two regions (US and EU) each with several availability zones (US currently has four.) The AZ's are designed to be isolated from one another. This outage affected one AZ in the US region.

    If you are doing load balancing across instances in multiple AZ's (or even using Amazon's own Elastic Load Balancing and Auto-Scaling features) you would have been fine, since this is exactly the kind of problem they're designed to handle.

Always draw your curves, then plot your reading.

Working...