October 29th, 2003

belize

storage, layer block device for memcached

I've been busy learning all about SANs, solid state disks (battery backed SDRAM vs. Flash), Fibre Channel and topologies, global filesystems, the Linux block device layer, etc...

I need to go on a spending spree and play with some of this cool hardware. Some of the cool stuff costs around $65k+, though.

I bought the new book Linux Kernel Development by Robert Love.

I've been searching for a reason to try my hand at kernel development (yeah, laugh) and have been reading the LKML, lwn.net kernel news, and various bits of the source for ages in preparation.

I'm thinking of writing a stacked block device driver that puts a multi-memcached-based caching layer on top of another block device. memcached would be extended to support a faster protocol, so memcached wouldn't have to do parsing and constructing internal hashing values. I'd make a parameter of the block device be a uniq ID, which would be sent along to the memcached servers, so multiple machines with independent block devices could share some subset of the same set of memcached servers.

The kernel driver would only give the remote memcached servers a few milliseconds to respond before considering them loaded or down and falling back to the lower block device.

The driver wouldn't do re-hashing, the avoid getting stale data after a machine disappears and reappears on the network, with its cache intact. (I suspected we might run into that problem on LiveJournal with memcached, and we have a bit lately... I'm working on a solution which I'll write about later.)

Anyway, I don't think would be too difficult. I'd be able to leverage the md driver (multi-disk, for software RAID/etc) in the kernel for the block device stacking, and I'd be able to leverage the nbd driver (network block device) for learning how to do network I/O within kernel space.

Alternatively, I could probably find a way to do this all in userspace. I know I could in FreeBSD (Jeff was just writing about GEOM_DISK the other day), and I know there are filesystems implemented in userland on Linux, so a userland block device isn't unreasonable to assume existing. But kernel would be more fun, for learning.

And don't worry--- it would be years until I put something like this into use on LiveJournal, if ever. Your data is safe. Don't fear my non-existent skills in kernel space.

(somewhat related: Brian Aker and I have been discussing extending MySQL's buffer/row caches system with memcached support... or making a new table handler completely. That might be an easier place to start, but block-device-level would be so much more cool.)
belize

Asynchronous Block Device

Another fun layered block device which I'm not easily finding existing examples of is an asynchronous block device, with ioctls which a userspace daemon to could use to force a global flush and synchronous behavior when a UPS says it's switching to battery power.

Sure, everybody mounts their filesystems async, but that mount option only affects parts of it, and doesn't effect flushes that end up calling sync.

Depending on the application, it might be beneficial to mount the filesystem sync on top of the async block device.

Here's how I see abd working:

-- you configure a maximum amount of unpageable memory to use for the driver's caches / data structures

-- syncs are ignored. always lie and say it's good.

-- writes are queued in memory, and an in-memory mapping from that sector to its new value is stored.

-- another thread does writes async, then removing the write from the queue and mapping.

-- reads check the mapping first, falling back to passing the request to the lower-layer block device

-- means of checking the number of uncommitted writes via ioctls/sysfs

-- means of using ioctl/sysfs to tell block device to stop accepting new async writes, and do all writes async, but accepting them at a slow rate or not at all (configurable), while driver works on flushing all ioctls

-- driver could issue tons of writes at a time, so the lower-level block device could do nice write-out scheduling. (Note: learn more about TCQ, aio)


Anybody know of any async block device driver tied into UPS notification daemons?

In my ideal world I'd layer block devices as so:

Application (database)
async block device
memcache block device
raw block device (locally attached raid driver)

Instant writes, really fast reads (with big cache), reliable storage. No super-pricey "appliances".