||[Feb. 17th, 2005|03:47 pm]
After the big Internap power failure recently, we no longer trust any storage product to work as advertised.|
I wrote a program to test disks/RAIDs and Matthew's been running it, finding out that, indeed, disks and RAIDs lie.
The test works like this:
-- Matthew goes to storage vendor with his laptop and crossover-connects his laptop and the server to be tested.
-- Matthew runs the server half of my disk tester on his laptop.
-- Matthew then runs the spewer client on a raw(8)-ified disk partition. the client picks random 16kB-aligned offsets on the partition and picks a random 32-bit number which it writes in hex (%08x) over a 16kB range. it reports to the spewserver both BEFORE and AFTER the disk write.
-- the server notes what the client said it was about to do and what it reported doing.
-- let it run for awhile....
-- Pull the power...
-- server notices client hasn't sent anything in 3 seconds, quits, writing out a map of what 32-bit number pattern should be at each sector.
-- power on server
-- copy map file laptop (spewserver) to the server, run spewclient in verify mode. it dumps a histogram of errors per seconds-before-powerloss:
Histogram of seconds before end:
Well, the 3 seconds is really because the "end" is considered time AFTER the 3 second timeout, so that's kinda a bug. That should read 0,1,2 seconds before, not 3,4,5. But see how there are 31 regions that are bogus at t=0, 7 at t=-1, and 1 at t=-2?
That means something was lying, and we don't buy that hardware until we get it configured so it doesn't lie.