Lightning Strikes Amazon's Cloud (Really) 109
The Register has details on a recent EC2 outage that is being blamed on a lightning strike that zapped a power distribution unit of the data center. The interruption only lasted around 6 hours, but the irony should last much longer. "While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down. But customers were also able to wait for their original instances to come back up after power was restored to the hardware in question."
Who covers the cost? (Score:2, Interesting)
Naive question: Are data centers usually insured for the cost of hardware replacement and/or loss of revenue in a situation like this?
Lightning once striked our office building. (Score:3, Interesting)
.
I have to wonder if those who are critical of Amazon here have ever experienced a direct lightning strike? I doubt it.
Do any of you know how they survived? (Score:5, Interesting)
Re:Inconcievable! (Score:2, Interesting)
I don't see how cloud hosting is somehow incompatible with hosting in two places.
Lightning once striked my friends house. (Score:1, Interesting)
Re:Lightning once striked our office building. (Score:3, Interesting)
Three days is lucky. My very first job (many, many moons ago) was at a company which had a few 5, 10 and 15 meter SATCOM dishes outside. One fall night, a set of severe T-storms rolled through around 2 am, and lightning struck the SAT farm. Nearly knocked me out of my NOC chair where I was fighting to stay awake, and I swore something big had exploded outside.
Turns out, one of the SAT dishes had not been properly grounded, and the current surged through the SATNET into our internal networks. Several mid-range systems, network gear, LAN pc's, modems, etc were fried. Console terminals were also, and if I'd been typing instead of fighting sleep, I would have been crispy.
The next several days were spent replacing the instantly fried gear. But initially unaffected systems started having serious glitches show up over the next few weeks. My guess is that Amazon may have this same problem.
Re:Lightning once striked our office building. (Score:3, Interesting)
I'm thinking critically because Amazon, EMC, VMWare, etc bill The Cloud as a mystical place where you throw your shit and then it's universally available 100%. Nothing bad happens in The Cloud.
No, they don't. You're either being disingenuous, or idiotic.
Per http://aws.amazon.com/ec2/#highlights [amazon.com], Amazon is promising "Reliable Amazon EC2 offers a highly reliable environment where replacement instances can be rapidly and predictably commissioned. The service runs within Amazons proven network infrastructure and datacenters. The Amazon EC2 Service Level Agreement commitment is 99.95% availability for each Amazon EC2 Region.
The irony here is that 6 hours in a year is 99.93% so they've already blown it for the year.
So what's the deal with having all copies of these VMs in one datacenter? That's not very The Cloud of them.
If it's only one instance running, its kinda hard to run it in multiple datacenters. They might be running clustering within a datacenter, but that can still be taken down by a power outages affecting multiple servers. As pointed out earlier, you can have instances in multiple datacenters (zones as they call it) if you're willing to pay for it.
Re:Lightning once striked our office building. (Score:5, Interesting)
Just so people know, this can be a real bitch.
I took a direct lightning strike at one site I work with that entered the corner of the building, traveled down the inside wall leaving a scorch mark on two levels and into the basement where all the servers and switches were located. The lightening then traveled through the electrical service main lines to an encased transformer located in the parking lot next door causing it to explode with enough force that is shattered the windows of the bank building next door and a door panel was found on a rood about a block away. It appears that one half of the electrical system was grounded properly through a specific ground rod and the other half was tied into the plumbing that ran inches away from the lightning rod grounds. When they purchased the building, they didn't redo all the electrical on the side of the building that wasn't remodeled and that way of grounding was normal.
We lost 3 of the 5 servers instantly and couldn't keep the other two stable. Both switches were down, 20 of the 44 workstations along with the tape backup machine, copiers, and networked printers were completely dead when we got there. The entire building had a lightning/surge protector with battery backup and natural gas generator on the mains so they weren't too concerned over in house specific protections. Only the systems with UPS on them directly survived with the exceptions of the servers which I'm not sure if they died from the lightning strike or from getting soaked by the fire sprinklers that was set off by the strike. (surprisingly, there was no fire).
It took us two days at almost 20 hours a day among 5 people with a lot of borrowing from other sites, about 20 trips to five or six computer stores in the surrounding counties, and a generator to come back on line and be operational again. We even had a make shift phone system in place while waiting on a new Avaya to come in. We did this all before the electric company got the transformer replaced and service back on. Until we replaced the other machines that were thought not to be effected, we experienced all sorts of weird behavior on the network and I'm still not confident with the cabling even though it passed the testing. [idealindustries.com] Of course I didn't run the certification so it might just be me not trusting others.
If you get a direct strike, you might as well count on replacing everything in a production environment. When I say direct strike, I mean evidence it actually hit the building and not something down the road and traveled to the building. It will be easier and cheaper in the long run. Now, I have as part of the catastrophe plan, a means to replace every computer and component on the network at one time just to be safe. If it wasn't for two other sites having the same tape drives, we would have had to wait a week for a replacement to come in and start the data recovery process. Thank god for off-site tape storage.
speaking of lightning and electronics. (Score:3, Interesting)
I don't remember the final resolution of the problem, but I do remember that from the 2nd strike until the problem was solved, every time I heard thunder I would run to the English building and with my newly assigned key, run upstairs and disconnect the rj-21 fanout cables. I would then leave a note on the English dept office informing them that they'd need to plug them in the in a.m. One evening, I didn't make it. I heard thunder and bolted for the English dept... I had my key in the buildings' outside door when lightning struck the building...and I knew I was too late. When I got upstairs, I could smell burnt electronics....
Probably at the same time as this was going on for me, my dad, who was a large-scale CSE had similar problems. I don't know how much 16-port line-cards for the system that he was supporting cost, but one day he had to replace eight or nine of them. The next day, UPS delivered two cases of copper-fiber-copper serial surge suppressors and he scheduled to install them. I don't think that site had problems after that.