Log in

No account? Create an account
brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Asynchronous Block Device [Oct. 29th, 2003|10:09 pm]
Brad Fitzpatrick
Another fun layered block device which I'm not easily finding existing examples of is an asynchronous block device, with ioctls which a userspace daemon to could use to force a global flush and synchronous behavior when a UPS says it's switching to battery power.

Sure, everybody mounts their filesystems async, but that mount option only affects parts of it, and doesn't effect flushes that end up calling sync.

Depending on the application, it might be beneficial to mount the filesystem sync on top of the async block device.

Here's how I see abd working:

-- you configure a maximum amount of unpageable memory to use for the driver's caches / data structures

-- syncs are ignored. always lie and say it's good.

-- writes are queued in memory, and an in-memory mapping from that sector to its new value is stored.

-- another thread does writes async, then removing the write from the queue and mapping.

-- reads check the mapping first, falling back to passing the request to the lower-layer block device

-- means of checking the number of uncommitted writes via ioctls/sysfs

-- means of using ioctl/sysfs to tell block device to stop accepting new async writes, and do all writes async, but accepting them at a slow rate or not at all (configurable), while driver works on flushing all ioctls

-- driver could issue tons of writes at a time, so the lower-level block device could do nice write-out scheduling. (Note: learn more about TCQ, aio)

Anybody know of any async block device driver tied into UPS notification daemons?

In my ideal world I'd layer block devices as so:

Application (database)
async block device
memcache block device
raw block device (locally attached raid driver)

Instant writes, really fast reads (with big cache), reliable storage. No super-pricey "appliances".

[User Picture]From: joggingguy
2003-10-30 06:31 am (UTC)
When we would set up Informix databases, we would get a boost by using raw io, and would get another 10 percent boost by using AIO as well. Using AIO eliminates a context switch in kernel space.
(Reply) (Thread)
[User Picture]From: taral
2003-10-30 09:31 am (UTC)
AIO. That's _exactly_ what it's there for. Unfortunately, I don't think Linux has kernel interfaces to support it properly, so glibc emulates it with a thread. Might want to check 2.6 though.

Also, I don't think the kernel caches block device reads/writes. So if you want that, try mapping the block device into memory.
(Reply) (Thread)
[User Picture]From: brad
2003-10-30 10:17 am (UTC)
But that'd require changing the application.

And if the application felt it needed to sync for certain things (flushing a journal, index files, etc), it'd get its way.

I want to flat out lie to the app, just as if I'd mounted its data on a fibre channel battery-backed RAM disk.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-10-31 09:48 am (UTC)
Oh. That's not available in any OS I know of. :P
(Reply) (Parent) (Thread)
[User Picture]From: brad
2003-10-31 10:35 am (UTC)
Yeah, I tried searching.

It'd be safe, though, if the UPS was properly configured.

Perhaps a way to make it even safer would be to default to synchronous behavior all the time, and require notifications from userspace that the UPS is functioning, and only then do async for 5 minutes or so, waiting for the next UPS notification in a couple minutes.

That way dumb people couldn't do it for extra speed and totally fuck themselves.
(Reply) (Parent) (Thread)
From: smartjournal
2003-10-30 01:55 pm (UTC)
and i have no idea what any of this means...
(Reply) (Thread)
From: jeffr
2003-11-01 10:28 pm (UTC)
Well, this is interesting, but I think it's at the wrong layer.

You really want the file system buffer cache to do a good job of keeping the right things in memory. It should only write out to disk when it is really necessary anyway. Users only get synchronous writes when they call fsync() or update metadata in a non softupdate, non journaled, sync filesystem.

If you implemented a block device cache as well you'd be competing for pages with the buffer cache, which presumably has a better idea of what the overall system io requirements are.

As far as issuing lots of IO at once for effeciency.. Most operating systems implement some io clustering. When combined with IO sorting, this gets you as close as you could get with block device caching.

If your application has special buffering requirements you should use O_DIRECT and manage the memory in user-space. This is what big databases do.
(Reply) (Thread)
[User Picture]From: brad
2003-11-02 02:11 am (UTC)
The block device cache would only be for unwritten blocks. It's not a general-purpose cache. Plus, I see this being used on machines with tons of memory to begin with, where setting aside an extra pool of memory is well worth it, especially when the alternative is an expensive appliance which already does this.

What I want to take advantage of is knowing reliable power is available. Even if the OS already does IO clustering and has a really efficient elevator, if an application does an fsync(), doesn't that mean it really does block until it gets written? That's what I want to replace, and without modifying all the applications. If not, though, maybe this all makes less sense. But the fact that other appliances do it makes it seem worthwhile.

I want to essentially change the semantics of fsync from "get state on disk and wait" to "ensure that state will get to disk eventually".

The whole point is not changing applications. It's letting the sysadmin tag block devices as async when it's safe to do so, without buying expensive hardware.

But maybe I can accomplish the same thing with a filesystem that supports journals on an external device (XFS/ext3 using a PCI NVRAM card or such). I'm going to get some of those umem.com cards and play around.
(Reply) (Parent) (Thread)
From: jeffr
2003-11-02 03:35 am (UTC)
I see.. So you aren't worried about software crashes losing the data, it's power failures only? If this is the case, you could just instrument fsync(). In FreeBSD's vfs_syscalls.c in the ^fsync() function, you would change:
error = VOP_FSYNC(vp, fp->f_cred, MNT_WAIT, td);
error = VOP_FSYNC(vp, fp->f_cred, MNT_WAIT, td);

And then change the syncer and buf daemon such that they only actually write out data if memory is needed or the power fails, or simply make the interval between writes much longer. In FreeBSD a buffer will be held for 30 seconds before we write it, unless we're running out of memory. Then we LRU them and free them until we have enough memory.

I imagine there would be analogous changes in linux. The point here is that if you're going to cache in memory on a machine it's most efficient to do it in one place. Otherwise you're going to have the page cache duplicating blocks that you have in your block device cache. You'll be wasting memory.

The VM is best at doing it's job when it has access to all of memory.

Using a NVRAM card is a pretty popular solution. This makes a lot of sense especially with journaling filesystems. I've worked on such a system with full data journaling before. The benefit here is that it will sustain software crashes as well as power failures. I'd recommend this if you'd like to delay your writes even further.
(Reply) (Parent) (Thread)
From: jeffr
2003-11-02 03:36 am (UTC)
Er, second VOP_FSYNC line should read MNT_NOWAIT :-)
(Reply) (Parent) (Thread)