Log in

No account? Create an account
Drive tester - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Drive tester [Feb. 17th, 2005|03:47 pm]
Brad Fitzpatrick
After the big Internap power failure recently, we no longer trust any storage product to work as advertised.

I wrote a program to test disks/RAIDs and Matthew's been running it, finding out that, indeed, disks and RAIDs lie.

The test works like this:

-- Matthew goes to storage vendor with his laptop and crossover-connects his laptop and the server to be tested.

-- Matthew runs the server half of my disk tester on his laptop.

-- Matthew then runs the spewer client on a raw(8)-ified disk partition. the client picks random 16kB-aligned offsets on the partition and picks a random 32-bit number which it writes in hex (%08x) over a 16kB range. it reports to the spewserver both BEFORE and AFTER the disk write.

-- the server notes what the client said it was about to do and what it reported doing.

-- let it run for awhile....

-- Pull the power...

-- server notices client hasn't sent anything in 3 seconds, quits, writing out a map of what 32-bit number pattern should be at each sector.

-- power on server

-- copy map file laptop (spewserver) to the server, run spewclient in verify mode. it dumps a histogram of errors per seconds-before-powerloss:

Histogram of seconds before end:
3 31
4 7
5 1
65 1

Well, the 3 seconds is really because the "end" is considered time AFTER the 3 second timeout, so that's kinda a bug. That should read 0,1,2 seconds before, not 3,4,5. But see how there are 31 regions that are bogus at t=0, 7 at t=-1, and 1 at t=-2?

That means something was lying, and we don't buy that hardware until we get it configured so it doesn't lie.

[User Picture]From: brad
2005-02-19 02:12 am (UTC)
I don't have all the results back from Matthew yet.

So far he's had success taking out disks from behind the RAID, changing their SCSI settings with scsirastools, and putting them back behind the RAID. Why the RAID card doesn't do this itself is fucking beyond me.
(Reply) (Parent) (Thread)