?

Log in

No account? Create an account
Drive tester - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Drive tester [Feb. 17th, 2005|03:47 pm]
Brad Fitzpatrick
After the big Internap power failure recently, we no longer trust any storage product to work as advertised.

I wrote a program to test disks/RAIDs and Matthew's been running it, finding out that, indeed, disks and RAIDs lie.

The test works like this:

-- Matthew goes to storage vendor with his laptop and crossover-connects his laptop and the server to be tested.

-- Matthew runs the server half of my disk tester on his laptop.

-- Matthew then runs the spewer client on a raw(8)-ified disk partition. the client picks random 16kB-aligned offsets on the partition and picks a random 32-bit number which it writes in hex (%08x) over a 16kB range. it reports to the spewserver both BEFORE and AFTER the disk write.

-- the server notes what the client said it was about to do and what it reported doing.

-- let it run for awhile....

-- Pull the power...

-- server notices client hasn't sent anything in 3 seconds, quits, writing out a map of what 32-bit number pattern should be at each sector.

-- power on server

-- copy map file laptop (spewserver) to the server, run spewclient in verify mode. it dumps a histogram of errors per seconds-before-powerloss:

Histogram of seconds before end:
3 31
4 7
5 1
65 1


Well, the 3 seconds is really because the "end" is considered time AFTER the 3 second timeout, so that's kinda a bug. That should read 0,1,2 seconds before, not 3,4,5. But see how there are 31 regions that are bogus at t=0, 7 at t=-1, and 1 at t=-2?

That means something was lying, and we don't buy that hardware until we get it configured so it doesn't lie.
LinkReply

Comments:
[User Picture]From: jwz
2005-02-17 11:51 pm (UTC)
I didn't follow what's running where at all. Maybe a drawing would help?
(Reply) (Thread)
[User Picture]From: brad
2005-02-17 11:57 pm (UTC)
[Laptop w/ spewserver]  <---------  [ Server w/ disks and spew client ]


I guess there's two "servers" there, which is confusing.

The point is you spew random I/O at the expensive server w/ the expensive disks, which reports it over the network to the laptop running the spewserver. Then you pull the power cable on the expensive server, reboot it, and compare its disk contents with that which the laptop recorded.
(Reply) (Parent) (Thread)
[User Picture]From: brentdax
2005-02-17 11:59 pm (UTC)
From what I understand:
  1. Connect hard drive/storage system/blah to computer A.
  2. Connect computer A to computer B.
  3. Make computer B tell computer A what to write where on the drive/system/blah.
  4. At some point, pull the plug on computer A.
  5. Compare what should have been written (which is recorded on computer B) to what actually was written (on computer A).
  6. ...
  7. Profit!
(Reply) (Parent) (Thread)
[User Picture]From: brad
2005-02-18 12:01 am (UTC)
6 = angsty teens put bad poetry on disks
(Reply) (Parent) (Thread)
[User Picture]From: ericjay
2005-02-18 12:35 am (UTC)
Wow, funniest comment reply all day.
(Reply) (Parent) (Thread)
From: evan
2005-02-18 01:31 am (UTC)
(Reply) (Parent) (Thread)
From: alpha
2005-02-18 01:48 am (UTC)
Ahahaha. Yeah it is.
(Reply) (Parent) (Thread)
(Deleted comment)
[User Picture]From: brad
2005-02-17 11:58 pm (UTC)
You'd think.

I'm just as flabbergasted.
(Reply) (Parent) (Thread)
[User Picture]From: spottman
2005-02-18 12:05 am (UTC)
Can I get a copy of this little script? My agency is looking at buying large arrays for PII data. This would be a GREAT test app.

If you save my agency money.. Your fed taxes might go down $0.000000000000000001 next year. :)
(Reply) (Thread)
[User Picture]From: brad
2005-02-18 12:11 am (UTC)
It's not much longer than my description of it, but email me and I'll try and polish it up a tad for release. It doesn't setup the raw mapping automatically right now, but it could/should, to be more user-friendly.
(Reply) (Parent) (Thread)
[User Picture]From: supersat
2005-02-18 02:08 am (UTC)
What was the vendor's response?
(Reply) (Thread)
[User Picture]From: 7leaguebootdisk
2005-02-18 04:58 am (UTC)

this came up on the linux kernel list some time back

It seems that write through is not really implemented on quite a lot of disks. There was a guy who wrote a program that would write a 1 to sector zero, and a 1 to sector 1, and then a 2 to zero and a 2 to sector 2, and so on, pull the plug at some point. guess what sector zero has in it? A zero, NONE of the updates got written, because they were replace before they got stale enough to get written.

I am not able to find the mention though.
(Reply) (Thread)
[User Picture]From: brad
2005-02-18 05:54 am (UTC)

Re: this came up on the linux kernel list some time back

I'm not writing to the block device layer with the IO scheduler, though. I made a raw(8) device from a block device so I could bypass all that and actually get stuff to disk.
(Reply) (Parent) (Thread)
[User Picture]From: caladri
2005-02-18 07:50 am (UTC)

Re: this came up on the linux kernel list some time back

O_DIRECT
(Reply) (Parent) (Thread)
[User Picture]From: brad
2005-02-18 08:02 am (UTC)

Re: this came up on the linux kernel list some time back

Well, uh, in theory. But it has a history of sucking... each filesystem implements it itself, and each used to have bugs related to O_DIRECT, one of which I found and reported recently (a deadlock in XFS).

I want to test the disks, not filesystem code. That's a separate issue.
(Reply) (Parent) (Thread)
[User Picture]From: 7leaguebootdisk
2005-02-18 04:23 pm (UTC)

Re: this came up on the linux kernel list some time back

That is just it, he did that, the problem is at a hardware level.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2005-02-18 05:02 pm (UTC)

Re: this came up on the linux kernel list some time back

Ah. Cool, so I'm not alone. :-)
(Reply) (Parent) (Thread)
(Deleted comment)
[User Picture]From: brad
2005-02-18 05:55 am (UTC)

Re: does it happen with Netapp?

We don't have a spare to test it, though.
(Reply) (Parent) (Thread)
[User Picture]From: moonwick
2005-02-18 06:56 am (UTC)
Brad, you're an evil genius. Cool idea.
(Reply) (Thread)
[User Picture]From: caladri
2005-02-18 07:48 am (UTC)
I decided not to overengineer tooo much for something slightly kinda sorta similar I'm doing at work. Using 8K "pages" it writes incrementing 32-bit integers starting at page->offset(0) and going to the end of the page. It's fairly easy to check for inconsistencies, and it also lets us check for something weird we've seen once or twice with a certain bit of software, where the data is shifted to the right by 4 bytes :/
(Reply) (Thread)
[User Picture]From: taral
2005-02-19 01:14 am (UTC)
Mwahahaha. Do they have battery backup?
(Reply) (Thread)
[User Picture]From: brad
2005-02-19 02:12 am (UTC)
I don't have all the results back from Matthew yet.

So far he's had success taking out disks from behind the RAID, changing their SCSI settings with scsirastools, and putting them back behind the RAID. Why the RAID card doesn't do this itself is fucking beyond me.
(Reply) (Parent) (Thread)