brad's life - Unexpected failure mode [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Unexpected failure mode [Jul. 5th, 2007|10:56 pm]
Previous Entry Add to Memories Share Next Entry
[Tags|, , , , ]
[Current Location |Grass Valley, CA]

Dear Lazyweb,

Let's imagine for a second, hypothetically, that some drunk girl sees Blinkenlights on the RAID array in your garage and says (reportedly) "Oh neat, hot swap!", and proceeds to remove 3 drives from a particular 5-disk RAID-5 array, instantly killing the party music and a quite a bit of other data [accessibility].

Hypothetically, that would look like this: (after a reboot, incidentally)
# cat /proc/mdstat 
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
md1 : inactive sda[0] sde[4] sdd[3] sdc[2] sdb[1]
      2441932480 blocks super non-persistent
       
md0 : active raid1 sdf1[0] sdg1[1]
      97659008 blocks [2/2] [UU]
      
unused devices: <none>

Since I'm currently in Grass Valley for nick's wedding, enjoying the ~110F heat (it was 113F/45C on the way up here!), I'll leave it to you guys, my friendly LazyWeb, to suggest how one might fix such a busted array, perhaps saving me some time reading mdadm(8).

Lovingly yours,
Brad
LinkReply

Comments:
[User Picture]From: scsi
2007-07-06 06:40 am (UTC)

(Link)

Maybe try:

mdadm -A --force /dev/md0

Will sync the superblocks. Assuming thats whats wrong. Dont hate me for hosing your array if this fails.
[User Picture]From: scsi
2007-07-06 06:41 am (UTC)

(Link)

er.. /dev/md1

You get the idea.
[User Picture]From: foobarbazbax
2007-07-06 08:07 am (UTC)

fun fact

(Link)

Grass Valley is in Nevada County, which is shaped like a revolver pointed at Nevada because those bastards ripped off their name.
[User Picture]From: brad
2007-07-06 04:04 pm (UTC)

Re: fun fact

(Link)

Hah, nice.
(Deleted comment)
[User Picture]From: supersat
2007-07-06 08:22 am (UTC)

(Link)

Drunk people? :)
(Deleted comment)
[User Picture]From: terrajen
2007-07-09 03:13 am (UTC)

(Link)

no, drunk people don't pull shit like that. i'm an expert on the subject.
[User Picture]From: dan_lane
2007-07-06 09:53 am (UTC)

(Link)

hypothetically I'd be looking for a place to dispose of said drunk girl's body.
[User Picture]From: crschmidt
2007-07-06 11:54 am (UTC)

(Link)

I was going to say that "Hypothetically, I'd have lost myself a party guest for the future." Your solution is slightly more dramatic.
[User Picture]From: flipzagging
2007-07-07 04:32 am (UTC)

(Link)

If only there was someone who was both a body disposal expert and a Linux filesystem guru!

Oh wait.
[User Picture]From: hoist2k
2007-07-09 07:25 pm (UTC)

(Link)

That was the funniest blog comment I've read in 2007. Genius!
[User Picture]From: brad
2007-07-09 07:29 pm (UTC)

(Link)

hah
[User Picture]From: byron
2007-07-06 01:49 pm (UTC)

(Link)

Hypothetically my first question would be the best way to dispose of a body followed up by a question about RAID repair.
From: evan
2007-07-06 03:14 pm (UTC)

(Link)

[User Picture]From: matthew
2007-07-06 04:13 pm (UTC)

(Link)

The howto makes sense, but I'd make one small change since this is a cascade failure and not simultaneous. I'd look at syslog to find which drive went offline first. Take the rest of the drives and force them back into a broken array with mdadm --assemble --force.

In theory this will put your array back to the state before the 2nd drive was yanked and your data will still be there (since the array stops once you lose 2 disks)

After that, add the last disk back in as a spare and rebuild.

ymmv
[User Picture]From: dakus
2007-07-06 03:18 pm (UTC)

(Link)

the raid was in the garage? seems like the heat would get to it?
[User Picture]From: brad
2007-07-06 04:05 pm (UTC)

(Link)

a) it's SF... brrr
b) underground
[User Picture]From: henry
2007-07-06 10:05 pm (UTC)

(Link)

just wrap that sucker in thermite cord and keep one ear to the scanner, all they'll get is a pile of molten metal. fuckin pigs!
From: rapidpacket.com
2007-07-06 09:16 pm (UTC)

(Link)

I think you have just sold me on getting lockable trays. This is the second time I've heard about this routine.
[User Picture]From: wcu
2007-07-08 06:57 am (UTC)

(Link)

ive seen her before, shes hot. party time!
[User Picture]From: ywong
2007-07-09 12:13 am (UTC)

(Link)

Some people, when they have a complex technical problem to solve, think "I know, I'll ask for advice on LiveJournal!"

Now they have two problems.
[User Picture]From: brad
2007-07-09 07:30 pm (UTC)

(Link)

Hah. nice jwz quote variant.
[User Picture]From: ywong
2007-07-10 04:53 am (UTC)

(Link)

My wife and I say that a lot when people talk about relationship problems, since asking for advice on LiveJournal (or rather, the internet) rarely leads to real solutions so much as it more often leads to HILARIOUS drama. It is slightly more useful for technical advice.
[User Picture]From: askbjoernhansen
2007-07-09 05:56 pm (UTC)

don't let it rebuild

(Link)

mdadm -E /dev/sd... is your friend. Go through each drive and you can figure out when they were "lost". This will be helpful. Watch out for drives that have been out for much longer than the rest.

Avoid having it rebuild.

Make sure you have a recent-ish mdadm and kernel - this stuff gets "I can't believe it worked before" fixes and improvements a lot.

Either force assemble (I think that might make it rebuild the last one) or you can _create the md device again!_ (as long as you keep the options the same). This sounds (and might be) crazy scary, but I've done it a few times to force it to go when I had one raid-5 failure and Linux decided that a second drive maybe was tired too. Yes, I rarely use raid-5 now.


- ask