? ?
brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ] [May. 9th, 2005|08:49 pm]
Brad Fitzpatrick
[Tags|, ]

Dear Slashdot comment flood: I didn't write the summary in the Slashdot story. The submitter did. I know the disks themselves don't handle an fsync().

I know fsync() only tells the operating system to flush. This script tests whether fsync works end-to-end. A database in userspace can only fsync... it can't send special IDE or SCSI "flush your buffers" commands to the disks. So that's what I care about and what I want to test: that a database can be Durable.

The problem is manufacturers shipping SCSI disks with a dangerous option (hardware write-caching) enabled by default. It makes sense for consumer ATA stuff, but for SCSI disks that already have reliable TCQ, there's much less point. And any respectable raid card should just disable write-back caching on the disks if the raid card has its own nvram-backed cache, but LSI doesn't anymore (they used to, but stopped).

And I'm glad Linux is finally starting to tell the disks to flush on an fsync. But months ago I was stuck with databases that couldn't survive power outages and needed a way to test whether everything from the filesystem to the block device driver to the disks themselves were doing what the database expected when it did an fsync. That's no component's job alone, so I needed to test everything together.

Remember my disk-checker program I wrote about before? I'd never released it because it was too hard to use, but now it's dead simple, so here it is: Edit: now

Run it and be amazed how much your disks/raid/OS lie. ("lie" = an fsync doesn't work)

It seems everything from PATA consumer disks to high-end server-class SCSI disks lie like crazy. Yes, that includes SATA there in the middle. I'll discuss fixing your storage components in a second.

In a nutshell, run it like this:

Tester machine (machine that won't crash):
$ -l

And then just let it chill. (the el is for listen). This program will listen (on port 5400 if no number follows -l) and will write one tiny file per host to /tmp/. It can be run as any user.

Machine being tested (machine you're going to pull the power cable on)
$ -s TESTERMACHINE create test_file 500

That creates a 500 MB file named "test_file" and it reports everything it's about to do and does to the TESTERMACHINE (which can be an IP or hostname).

Now, pull the power cable on the machine being tested. Don't turn it off nicely. Don't just control-C the program. Wait a couple seconds and plug your testee machine back in and reboot it. When it's back up, do:

$ -s TESTERMACHINE verify test_file

If the server process is still running, the machine you just killed will connect to the server and information about what's supposed to be where. The client will then verify it and produce an error report.

What you should see is:

Total errors: 0.

But you probably won't. You'll probably see an error count and a histogram of errors per seconds-before-crash. Most RAID cards lie (especially LSI ones), some OSes lie (rare), and most disks lie (doesn't matter how expensive or cheap they are). They lie because their competitors do and they figure it's more important to look competitive because the magazines only print speed numbers, not reliability stats. They must figure people who care about their disks working know how to test/fix their disks.

Ways to maybe fix your disk:

hdparm -W 0 /dev/hda -- worked on a crap office PATA disk (and it failed otherwise)
scsirastools -- need this on lots of SCSI disks. you'll probably have to remove your SCSI disks from your RAID card and fix the disks directly, since RAID cards very often won't disable it for you

Anybody have anything else to add?


Page 1 of 2
<<[1] [2] >>
[User Picture]From: wetzel
2005-05-10 04:07 am (UTC)
will this bit of script work on windows (w/ activeperl), or will it freak on me like it's retarded?

or is the only way to find out to test it?
(Reply) (Thread)
[User Picture]From: brad
2005-05-10 05:20 am (UTC)
Should work. I use nothing fancy. All pure, base Perl.
(Reply) (Parent) (Thread)
[User Picture]From: electromage
2005-05-10 04:14 am (UTC)
There's far too little emphasis on reliability in benchmarks and other comparisons.
(Reply) (Thread)
[User Picture]From: claystorm
2005-05-10 04:56 am (UTC)

Linux only?

Is this Linux only, or can it be run in a Windows environment with a perl shell running in Windows?
(Reply) (Thread)
[User Picture]From: brad
2005-05-10 05:20 am (UTC)

Re: Linux only?

Should work on Windows unmodified.
(Reply) (Parent) (Thread)
(Deleted comment)
(Deleted comment)
(Deleted comment)
(Deleted comment)
(Deleted comment)
[User Picture]From: brad
2005-05-10 05:39 am (UTC)
Go for it.
(Reply) (Parent) (Thread)
[User Picture]From: gholam
2005-05-10 05:48 am (UTC)
Isn't this... caching? By pulling the plug, you kill whatever data is pending write in drive's volatile RAM cache, and that's what they have UPSes and stupidly expensive battery-backup-capable RAID controllers for.
(Reply) (Thread)
[User Picture]From: brad
2005-05-10 05:58 am (UTC)
Re-read my post. I'm testing whether the entire storage stack respects the fsync() system call. All the way from the OS to the drivers to the raid array to the disks themselves.

The fsync() system call says: "Stop caching, it's very important that everything I've given you now must be on disk, and don't return to me with an answer until it has."

So this program tests that your fsync() works as advertised and some part of the storage stack isn't faking the fsync. (it's usually the disks themselves, against specs, and unknown the operating system, which thinks the disks are behaving)

Otherwise caching's just fine and it's done all over. My complaint is when it's done when you tell it not to.
(Reply) (Parent) (Thread) (Expand)
(no subject) - (Anonymous) Expand
[User Picture]From: ydna
2005-05-10 07:34 am (UTC)
Sweet, Brad. Perfect timing. I've got an ATA-over-Ethernet evaluation unit from Coraid on its way. I'm planning several tests on this equipment for a demo and presentation up here in June (background: see my rant and LUG discussion). It'll be fun to add this test to the mix (and pull the plug at different places in the connection to see which part does most of the lying). Thankee much.
(Reply) (Thread)
[User Picture]From: brad
2005-05-10 07:52 am (UTC)
I still don't really get AoE. Who's it supposed to be for?

BTW, you mentioned in one of those posts capturing ethernet frames in userspace to make an AoE server. Look at "tap". In the kernel source, read:


And docs for CONFIG_FILTER=y and CONFIG_PACKET=y. Between the three, you could start to write an AoE server.
(Reply) (Parent) (Thread) (Expand)
[User Picture]From: boggyb
2005-05-10 08:50 am (UTC)
Some Windows-related details of disk caching that I've found out about (and may be useful to those of you using Windows).

Under Windows NT (that's 2k, XP, 2k3 as well - remember those are NT 5.x), you can disable disk caching by going to the properties for the hard disk and then going to the Settings/Properties/Policies tab (is named differently under different versions but does the same thing). By default write cache is enabled except on disks containing the Active Directory. It is also disabled by default for removeable disks (USB, memory cards (but not all!), iPods, etc.) and apparently anything it thinks is SCSI (this includes some IDE/SATA controllers). I think Windows tells the disk itself to explicitly disable the disk cache as well.

When using CreateFile(), you can set the FILE_FLAG_WRITETHROUGH flag, which tells Windows to flush write caches quicker than usual, or you can set FILE_FLAG_NO_BUFFERING, which bypasses all Windows caches but has some restrictions (you have to be careful with memory alignment and read in multiples of the sector size). These are set in the dwFlagsAndAttributes parameter.
(Reply) (Thread)
[User Picture]From: mendel
2005-05-10 02:26 pm (UTC)
5.8.1's Getopt::Long ignores this, but 5.6.1's chokes on it.
---     2005-05-09 20:28:18.000000000 -0400
+++      2005-05-10 10:23:39.647304453 -0400
@@ -22,3 +22,3 @@
 usage() unless GetOptions('server=s' => \$server,
-                          'listen:5400' => \$listen);
+                          'listen:i' => \$listen);
 usage() unless $server || $listen;
(Reply) (Thread)
[User Picture]From: mendel
2005-05-10 02:37 pm (UTC)
Hah, no, that's not enough! You and your little ands and ors.
---     2005-05-09 20:28:18.000000000 -0400
+++      2005-05-10 10:33:01.437798948 -0400
@@ -21,10 +21,11 @@
 my $listen;
 usage() unless GetOptions('server=s' => \$server,
-                          'listen:5400' => \$listen);
-usage() unless $server || $listen;
-usage() if     $server && $listen;
+                          'listen:i' => \$listen);
+usage() unless $server || defined $listen;
+usage() if     $server && defined $listen;

-listen_mode($listen) if $listen;
+my $port = $listen ? $listen : 5400;
+listen_mode($port) if defined $listen;

(Reply) (Parent) (Thread) (Expand)
[User Picture]From: scosol
2005-05-11 01:48 pm (UTC)
does it work on windows????? :P
(Reply) (Thread)
[User Picture]From: pne
2005-05-11 04:16 pm (UTC)
It should, as several previous comments mentioned.

You'd need a Perl interpreter, of course, such as ActivePerl, IndigoPerl, or Cygwin's perl.
(Reply) (Parent) (Thread) (Expand)
From: orbadelic
2005-05-11 08:25 pm (UTC)


so the scsirastools are used to disable write caching on the drives themselves? Going to run this later today with a freshly minted dell running freebsd 5.4, I wonder if there are any similar tools for it.
(Reply) (Thread)
[User Picture]From: vyhuhol
2005-05-13 08:03 am (UTC)
Isn't this hdparm -W option causes severe performance loss?
(Reply) (Thread)
From: (Anonymous)
2005-05-13 08:04 am (UTC)

We need a "name and shame" list or wiki

This is very interesting, but we can only check the disks that we have physical access to.

What we need is a public "name and shame" list of disk models, so folks who *require* reliable data integrity can make informed choices.

Anyone interested in setting up an online database or Wiki?

(Reply) (Thread)
From: (Anonymous)
2005-05-13 03:20 pm (UTC)

Re: We need a "name and shame" list or wiki


I'll talk to my boss and see if he'll allow me to create a small space on our webserver for a "name and shame" type thing that you're talking about (if no one else has done it, that is). It might not be a Wiki (just a static webpage at first) but it might be a good excuse for me to work on getting the company Wiki up and running 80).

S. Garcia
SLM Industries
~NOSPAM~ steven &DOT& garcia %AT% slmindustries ~DOT~ com ^SPAMISEVIL^
(Reply) (Parent) (Thread)
From: (Anonymous)
2005-05-13 08:15 am (UTC)

Nothing strange here, its your assumptions that are wrong

It is strange that after finding out that fsync() never functions the way you expect, you didn't doubt your assumptions or your code!
Your assumption is wrong! fsync() is required to flush the hard-disk cache. It's function is just to flush all buffers inside the OS and the device driver to the hard-disk and it does that perfectly.
As you see, these are two different things: flushing to the device and flushing to the disk. fsync() flushes to the device but you expect it to flush to the disk (which it does not). Flush to the disk usually can only be done in a system level program (like a device driver) and not a user program.
(Reply) (Thread)
[User Picture]From: brad
2005-05-13 02:24 pm (UTC)

Re: Nothing strange here, its your assumptions that are wrong

Yes, but that's all a userspace program like a database can do. And while I know that fsync() says it can't promise it makes it to disk if write-caching is enabled, this tool is a great way to see if your disk's write-caching is indeed on.

The old version of this script used the raw(8) interface to bypass all filesystems and kernel buffers, but it produced the same results as the fsync version, so I made it just use fsync so it was easier to use. (don't need a spare block device on the disk(s) being tested handy....)
(Reply) (Parent) (Thread)
[User Picture]From: error10
2005-05-13 08:29 am (UTC)

hdparm -W 0 /dev/hda

I needed this two months ago! I suffered massive filesystem corruption due to exactly this problem. Well, this and the fact that ext3 still sucks. And why doesn't Red Hat do reiserfs anyway?
(Reply) (Thread)
Page 1 of 2
<<[1] [2] >>