?

Log in

No account? Create an account
NFS blows, so.... FUSE! - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

NFS blows, so.... FUSE! [Jul. 7th, 2004|03:16 pm]
Brad Fitzpatrick
[Tags|, ]

Because NFS blows in more ways than I have fingers, I've started down the road of making MogileFS (our distributed filesystem currently in use on picpix.com/pics.livejournal.com) a true[r] filesystem. NFS seems to not reclaim its nfs_inode_cache (previously discussed here), it has problems with file corruption, it can't export (until v4) hierarchies made of different devices to wildcard hosts, and filehandles go stale.

Using FUSE, I got a dummy filesystem working from Perl, after fixing up the bitrotted Perl bindings for FUSE.

Check this:

bini:/home/bradfitz/proj/fuse/mnt# ls -l
total 544618828
-rwxr-xr-x 1 bradfitz bradfitz 1234 Jul 7 2004 Hello_World_at_1089237655
bini:/home/bradfitz/proj/fuse/mnt# ls -l
total 544618828
-rwxr-xr-x 1 bradfitz bradfitz 1234 Jul 7 2004 Hello_World_at_1089237656
bini:/home/bradfitz/proj/fuse/mnt# ls -l
total 544618829
-rwxr-xr-x 1 bradfitz bradfitz 1234 Jul 7 2004 Hello_World_at_1089237657
bini:/home/bradfitz/proj/fuse/mnt# cat 5+34
39
bini:/home/bradfitz/proj/fuse/mnt# cat 5*5
25

There are no files here. The directory listing just shows "Hello_World_at_TIME" and doing a read on any math-looking expression returns the evaluation of that expression.

The way it works is:

-- load the FUSE kernel module
-- start a userspace daemon, which talks to FUSE's device node, telling the kernel to place a FUSE mount somewhere
-- all VFS (virtual filesystem) operations the kernel receives are given a unique number and hurled back at the userspace daemon through the device node
-- the userspace daemon (C, Perl, or Python) has an event loop waiting on the kernel to send it work, and it dispatches subs to handle those calls.

Presumably I can reply out of order, too, so the next step is to rewrite FUSE's event loop and integrate it into Danga::Socket's event loop, or hope it's designed to easily plug into master event loops like Linux::AIO was.

After that, we kill NFS and make a light-weight file serving daemon (think Perlbal but without HTTP header parsing, still doing sendfile and such, so it's quick), and make the LUFS daemon get its files from the network and feed to the kernel.

It's kinda lame that we're going to be doing so many copies: network in to LUFS, LUFS to kernel, kernel to user via sendfile. But it shouldn't be too bad.

Fun stuff.

Now, who to recruit to work on this with me? Unfortunately I think everybody has a dozen projects already.
LinkReply

Comments:
[User Picture]From: peter_zaitsev
2004-07-07 10:02 pm (UTC)

What is about HA stuff

That is cool, but how are you going to maintain high availability of the things ? You perhaps want to keep couple of copies of file just in case ?

This actually looks pretty similar concept to mysqlfs which once exist,
but that was CORBA based so this is perhaps different technology.

(Reply) (Thread)
[User Picture]From: brad
2004-07-07 10:26 pm (UTC)

Re: What is about HA stuff

Each file in the system belongs to a class, and each class has a different minimum replica count. The filesystem automatically maintains the minimum replica count for all files, both when they're just created, and when devices die.

I've written about it about a dozen times now, but I haven't put it all in one place yet. Keep an eye out on my journal and I'll post a link to its web page once I make it.
(Reply) (Parent) (Thread)