After some pub grub and beer, Whitaker and I proceeded to hack on the one notable hurdle before FotoBilder goes live...
MogileFS
We have two behemoth machines coming in next week, each with sixteen 250GB drives. Rather than pay some vendor for some distributed/replicated filesystem, I figured it'd be pretty easy to do on our own.
Think memcached for files, but reliable. A way to scale storage out over tons of machines and disks. It tracks where files are on different machines and keeps enough replicas on different disks (making sure they're on separate hosts too). Hosts and devices can be marked down for maintenance, or auto-detected as temporarily down.
Whitaker's great addition to the scheme is classes of files. On FotoBilder, for instance, a lot of the files are transformations of other files (scaled down versions and thumbnails). They don't need to be replicated around as much, as they can always be recreated on the fly. It's just nice to have them around at least once to cut down CPU usage.
Anyway, the framework (MogileFS) knows all about where files are (on what devices), what devices are on what hosts, the status of all devices, etc.
When a disk (device) dies and is either manually (or in the future, automatically) marked as dead, the system starts re-replicating all the files that were on that device to other devices.
We got the database schema done, as well as a big chunk of the client library. The management tools and background daemons are sketched out in pseudo code and should be pretty easy to finish.
The cool thing about this is that we'll get a lot more logical space out of our 8TB of storage than we would with some naive RAID setup and replication. (Our old plan was both sides of 4TB doing their own RAID 10, cutting down to 2TB on a single machine, then having that replicated "somehow"....) Now the plan is concrete and we'll get closer to 8TB of storage than 2TB. (Whitaker's class idea really helps, since much of the disk space doesn't need to be replicated....)
I'll write more about all this later. We're keeping it totally generic and easy to plug in to any other application. (We'll be using it for both FotoBilder and LiveJournal)
P.S. The reason this is so easy is that the files are immutable. The namespace can change any time, but when it does the old files are marked to be deleted and the new file is replicated around. You can't just go write to the middle of files or anything.