Powerloss [May. 17th, 2004|07:51 pm]
Brad Fitzpatrick
So I guess we lost an entire cabinet (one of our 4) at Internap because power cut out. 20 servers down, but the site's still running. (Except for users on one cluster, because both masters were in the same cluster.... need to fix that!)

I just got back from dinner and Lisa's on it, but I think we lost:

2 storage machines (not yet used)
2 databases
16 web nodes

Dear Internap, not a good gesture right when we're back in the middle of contract negotiations.

So I guess it's our fault since we tripped a circuit breaker and weren't tracking how much power we were using. But how should we know? We had no means of monitoring it, and servers don't exactly say anywhere how much they're using. Guess this was our lesson learned. Time to buy some power meters and stuff.

[User Picture]From: tjousk
2004-05-17 08:17 pm (UTC)
You're doing rather well to have the site stay mostly useable when something like that happens.
Being as I'm on the madcow cluster, I didn't expect to be able to comment/view my friends page at all, but I guess that's because some stuff is on the master cluster...

Good luck on a prompt and appropriate fix.
[User Picture]From: ydna
2004-05-17 08:59 pm (UTC)
"Oops. Sorry about the backhoe there, fellas. I didn't see your power conduit there."
[User Picture]From: krellis
2004-05-17 09:00 pm (UTC)
Sounds like when they cut out power to the entire A feed (half of the entire datacenter) at Internap Boston because an idiot technician fucked up when performing UPS maintenance and EPO'd the whole thing. Hence my earlier recommendation to avoid Internap like the plague :)
[User Picture]From: lisa
2004-05-17 11:08 pm (UTC)
or not.
[User Picture]From: erik
2004-05-17 11:43 pm (UTC)
At my office we were getting dangerously close to overloading one of the circuits in our server room, and the building maintenance people (who are totally awesome and knowledgable and instantly responsive) came up with some computer-generated drawing that tells you exactly which circuit it is and has comprehensive info on how much each circuit was using.

Granted it'd be nice if we could directly access all of that information, but I was very impressed with how quickly they responded to the problem. We moved some servers' power cords around and all was well again.
[User Picture]From: scsi
2004-05-18 12:17 am (UTC)
I just found out im pulling 16 amps out of an 18 amp circuit in the DJ rack.. :( That means i have 1/4 rack that i'm paying for, but can't plug anything in because i'll level the entire cabinet... If I cold boot one of my DB servers the draw from all the drives spinning up will probably level the cabinet.

I feel your pain. I only know because I work there.. Or else I would have no clue.. The power gods are frowning upon us.
[User Picture]From: mendel
2004-05-18 06:28 am (UTC)
Ow, that sucks hard. You've got me nervous now, though: did that rack only have one circuit, or did you take out two, or did you expect two and turn out to have only one?

And be sure to take it easy on Internap come renegotiation -- maintenance on the runway lights isn't cheap, you know ;-)
[User Picture]From: spottman
2004-05-18 07:11 am (UTC)
Power poles by rpc (baytechdcd.com) They give you true RMS current so not only do you see the inbound current toyour machines but you see the return current on the nuetral from all of the switching going on from your power supplies.

Don't underestimate the ability of power supplies franticly trying to switch from a sine wave to DC I have seen some racks push 1-2 Amps BACK to the circuit (CCCHEAP power supplies).

So yeah, hook up machines to RPC units, hook up the power poles to a $100 ebay terminal server, write quick perl script to log in and screen scrape the RMS current, setup mrtg/nagios/cricket/whatever to monitor/graph and viola you have a monitorable power system. Set it to alarm you whenever you are drawing more than 75% of your rated Amperage. You can go to 80% and hold it forever (depending on the grade of circuit breaker) and you can spike to 100% if you really want to chance it!

good luck :)
[User Picture]From: brad
2004-05-18 08:16 am (UTC)
Thanks for the link/info!
[User Picture]From: alachicky
2004-05-18 07:29 am (UTC)
Hopefully you can get it fixed soon. And hey a lesson learned is experienced gained :)
[User Picture]From: travisd
2004-05-18 08:17 am (UTC)
Time to look for something like this.

Power seems to be the biggest problem now - used to be space, but we're seeing power as the big problem now in all of our datacenters. And with power, comes cooling of course.

[User Picture]From: mucho_suerte
2004-05-18 09:16 am (UTC)
At least you get to find out exactly what happened, and maybe make decisions around what you know. Our ISPs and host centers are the ultimate Bermuda Triangles - you know there was a problem, but heaven forbid that they reveal what exactly happened. Thanks for sharing what you found out - enlightening learning experience all around.
