Remember my disk-checker program I wrote about before? I'd never released it because it was too hard to use, but now it's dead simple, so here it is:
Run it and be amazed how much your disks/raid/OS lie. ("lie" = an fsync doesn't work)
It seems everything from PATA consumer disks to high-end server-class SCSI disks lie like crazy. Yes, that includes SATA there in the middle. I'll discuss fixing your storage components in a second.
In a nutshell, run it like this:
Tester machine (machine that won't crash):
$ diskchecker.pl -l
And then just let it chill. (the el is for listen). This program will listen (on port 5400 if no number follows -l) and will write one tiny file per host to /tmp/. It can be run as any user.
Machine being tested (machine you're going to pull the power cable on)
$ diskchecker.pl -s TESTERMACHINE create test_file 500
That creates a 500 MB file named "test_file" and it reports everything it's about to do and does to the TESTERMACHINE (which can be an IP or hostname).
Now, pull the power cable on the machine being tested. Don't turn it off nicely. Don't just control-C the program. Wait a couple seconds and plug your testee machine back in and reboot it. When it's back up, do:
$ diskchecker.pl -s TESTERMACHINE verify test_file
If the server process is still running, the machine you just killed will connect to the server and information about what's supposed to be where. The client will then verify it and produce an error report.
What you should see is:
Total errors: 0.
But you probably won't. You'll probably see an error count and a histogram of errors per seconds-before-crash. Most RAID cards lie (especially LSI ones), some OSes lie (rare), and most disks lie (doesn't matter how expensive or cheap they are). They lie because their competitors do and they figure it's more important to look competitive because the magazines only print speed numbers, not reliability stats. They must figure people who care about their disks working know how to test/fix their disks.
Ways to maybe fix your disk:
hdparm -W 0 /dev/hda -- worked on a crap office PATA disk (and it failed otherwise)
scsirastools -- need this on lots of SCSI disks. you'll probably have to remove your SCSI disks from your RAID card and fix the disks directly, since RAID cards very often won't disable it for you
Anybody have anything else to add?