Brad Fitzpatrick (brad) wrote,
Brad Fitzpatrick

Drive tester

After the big Internap power failure recently, we no longer trust any storage product to work as advertised.

I wrote a program to test disks/RAIDs and Matthew's been running it, finding out that, indeed, disks and RAIDs lie.

The test works like this:

-- Matthew goes to storage vendor with his laptop and crossover-connects his laptop and the server to be tested.

-- Matthew runs the server half of my disk tester on his laptop.

-- Matthew then runs the spewer client on a raw(8)-ified disk partition. the client picks random 16kB-aligned offsets on the partition and picks a random 32-bit number which it writes in hex (%08x) over a 16kB range. it reports to the spewserver both BEFORE and AFTER the disk write.

-- the server notes what the client said it was about to do and what it reported doing.

-- let it run for awhile....

-- Pull the power...

-- server notices client hasn't sent anything in 3 seconds, quits, writing out a map of what 32-bit number pattern should be at each sector.

-- power on server

-- copy map file laptop (spewserver) to the server, run spewclient in verify mode. it dumps a histogram of errors per seconds-before-powerloss:

Histogram of seconds before end:
3 31
4 7
5 1
65 1

Well, the 3 seconds is really because the "end" is considered time AFTER the 3 second timeout, so that's kinda a bug. That should read 0,1,2 seconds before, not 3,4,5. But see how there are 31 regions that are bogus at t=0, 7 at t=-1, and 1 at t=-2?

That means something was lying, and we don't buy that hardware until we get it configured so it doesn't lie.

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.