?

Log in

No account? Create an account
NFS blows, so.... FUSE! - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

NFS blows, so.... FUSE! [Jul. 7th, 2004|03:16 pm]
Brad Fitzpatrick
[Tags|, ]

Because NFS blows in more ways than I have fingers, I've started down the road of making MogileFS (our distributed filesystem currently in use on picpix.com/pics.livejournal.com) a true[r] filesystem. NFS seems to not reclaim its nfs_inode_cache (previously discussed here), it has problems with file corruption, it can't export (until v4) hierarchies made of different devices to wildcard hosts, and filehandles go stale.

Using FUSE, I got a dummy filesystem working from Perl, after fixing up the bitrotted Perl bindings for FUSE.

Check this:

bini:/home/bradfitz/proj/fuse/mnt# ls -l
total 544618828
-rwxr-xr-x 1 bradfitz bradfitz 1234 Jul 7 2004 Hello_World_at_1089237655
bini:/home/bradfitz/proj/fuse/mnt# ls -l
total 544618828
-rwxr-xr-x 1 bradfitz bradfitz 1234 Jul 7 2004 Hello_World_at_1089237656
bini:/home/bradfitz/proj/fuse/mnt# ls -l
total 544618829
-rwxr-xr-x 1 bradfitz bradfitz 1234 Jul 7 2004 Hello_World_at_1089237657
bini:/home/bradfitz/proj/fuse/mnt# cat 5+34
39
bini:/home/bradfitz/proj/fuse/mnt# cat 5*5
25

There are no files here. The directory listing just shows "Hello_World_at_TIME" and doing a read on any math-looking expression returns the evaluation of that expression.

The way it works is:

-- load the FUSE kernel module
-- start a userspace daemon, which talks to FUSE's device node, telling the kernel to place a FUSE mount somewhere
-- all VFS (virtual filesystem) operations the kernel receives are given a unique number and hurled back at the userspace daemon through the device node
-- the userspace daemon (C, Perl, or Python) has an event loop waiting on the kernel to send it work, and it dispatches subs to handle those calls.

Presumably I can reply out of order, too, so the next step is to rewrite FUSE's event loop and integrate it into Danga::Socket's event loop, or hope it's designed to easily plug into master event loops like Linux::AIO was.

After that, we kill NFS and make a light-weight file serving daemon (think Perlbal but without HTTP header parsing, still doing sendfile and such, so it's quick), and make the LUFS daemon get its files from the network and feed to the kernel.

It's kinda lame that we're going to be doing so many copies: network in to LUFS, LUFS to kernel, kernel to user via sendfile. But it shouldn't be too bad.

Fun stuff.

Now, who to recruit to work on this with me? Unfortunately I think everybody has a dozen projects already.
LinkReply

Comments:
[User Picture]From: brad
2004-07-07 05:19 pm (UTC)
NFS works okay most the time. It's the corner cases and special setups where it starts to suck.

I think NFS is hated universally.
(Reply) (Parent) (Thread)
From: snej
2004-07-07 05:22 pm (UTC)
Oh, I already know plenty of reasons to hate it. Just hadn't heard of its being flat-out unreliable.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2004-07-07 05:30 pm (UTC)
Stale mounts and export limitations I could deal with.

It crossed the fucking line when it started corrupting data with pages of zero bytes.
(Reply) (Parent) (Thread)
[User Picture]From: xaosenkosmos
2004-07-07 08:29 pm (UTC)
*sticks fingers in ears* lalalalala!!! I can't hear you!!

The zero-byte crap makes me very nervous. We haven't seen it (as far as we know), but it makes me afraid. Have you done any work to isolate/identify the problem, or are you guys just moving along and trying to avoid it?
(Reply) (Parent) (Thread)
[User Picture]From: brad
2004-07-07 09:40 pm (UTC)
I think it happen{ed|s} only during load when there wasn't free memory to allocate for that NFS page. But what should've happened is an error, not a silent corruption. We rebooted the offending client with a kernel upgrade and it "fixed" it, but I'm not sure it's a result of the kernel upgrade or the corrupted caches being reset, or more memory being available.
(Reply) (Parent) (Thread)