Log in

No account? Create an account
brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

NFS blows, so.... FUSE! [Jul. 7th, 2004|03:16 pm]
Brad Fitzpatrick
[Tags|, ]

Because NFS blows in more ways than I have fingers, I've started down the road of making MogileFS (our distributed filesystem currently in use on picpix.com/pics.livejournal.com) a true[r] filesystem. NFS seems to not reclaim its nfs_inode_cache (previously discussed here), it has problems with file corruption, it can't export (until v4) hierarchies made of different devices to wildcard hosts, and filehandles go stale.

Using FUSE, I got a dummy filesystem working from Perl, after fixing up the bitrotted Perl bindings for FUSE.

Check this:

bini:/home/bradfitz/proj/fuse/mnt# ls -l
total 544618828
-rwxr-xr-x 1 bradfitz bradfitz 1234 Jul 7 2004 Hello_World_at_1089237655
bini:/home/bradfitz/proj/fuse/mnt# ls -l
total 544618828
-rwxr-xr-x 1 bradfitz bradfitz 1234 Jul 7 2004 Hello_World_at_1089237656
bini:/home/bradfitz/proj/fuse/mnt# ls -l
total 544618829
-rwxr-xr-x 1 bradfitz bradfitz 1234 Jul 7 2004 Hello_World_at_1089237657
bini:/home/bradfitz/proj/fuse/mnt# cat 5+34
bini:/home/bradfitz/proj/fuse/mnt# cat 5*5

There are no files here. The directory listing just shows "Hello_World_at_TIME" and doing a read on any math-looking expression returns the evaluation of that expression.

The way it works is:

-- load the FUSE kernel module
-- start a userspace daemon, which talks to FUSE's device node, telling the kernel to place a FUSE mount somewhere
-- all VFS (virtual filesystem) operations the kernel receives are given a unique number and hurled back at the userspace daemon through the device node
-- the userspace daemon (C, Perl, or Python) has an event loop waiting on the kernel to send it work, and it dispatches subs to handle those calls.

Presumably I can reply out of order, too, so the next step is to rewrite FUSE's event loop and integrate it into Danga::Socket's event loop, or hope it's designed to easily plug into master event loops like Linux::AIO was.

After that, we kill NFS and make a light-weight file serving daemon (think Perlbal but without HTTP header parsing, still doing sendfile and such, so it's quick), and make the LUFS daemon get its files from the network and feed to the kernel.

It's kinda lame that we're going to be doing so many copies: network in to LUFS, LUFS to kernel, kernel to user via sendfile. But it shouldn't be too bad.

Fun stuff.

Now, who to recruit to work on this with me? Unfortunately I think everybody has a dozen projects already.

[User Picture]From: whitaker
2004-07-07 03:23 pm (UTC)
Sorry I didn't get to playing with that sooner. I'm really interested, just don't have enough time. :-/
(Reply) (Thread)
[User Picture]From: brad
2004-07-07 03:28 pm (UTC)
All good. It's more important you work on your current projects. There remains a possibility we'll get NFS to not be a bitch without doing any work, or at worst we can just tolerate it for a few months and do ugly kludges to make it work, but there's no fairy godmother that's going to come and do all your work for free.
(Reply) (Parent) (Thread)
[User Picture]From: mart
2004-07-07 03:36 pm (UTC)

This FUSE thing sounds like fun. I might have to take a look at it myself… sometime.

How many wacky daemons does Danga have now? I remember memcached (obviously), ddlockd, mogilefsd, mailgated… does currently-available free software suck so much that everything has to be made from scratch, or are you doing it for fun?

It's weird that LJ is the first web application that needed all this stuff. Cool that you guys are the ones making it, though.

(Reply) (Thread)
[User Picture]From: brad
2004-07-07 03:55 pm (UTC)
perlbal: mod_proxy kinda sucked. not very flexible, either.

memcached: nothing existed.

ddlockd: couple options existed, but one didn't quite do what i wanted, and another wasn't open source at the time, and was incredibly large and overkill. plus ddlockd is tiny.

mailgated: just mailgate running forever, unspooling stuff.

phonepostd: same as mailgated pretty much. they should be merged at some point in time.

(Reply) (Parent) (Thread)
[User Picture]From: scsi
2004-07-07 03:57 pm (UTC)
Eh, NFS v4 over TCP might blow as well, plus its still experimental.. NFS should die a horrible death.. :( Too bad there werent any good iSCSI open source implmentations..
(Reply) (Thread)
[User Picture]From: brad
2004-07-07 03:59 pm (UTC)
There are a dozen iSCSI projects. Both targets and initiators.

But that just gives you cheap, remote block devices.

The missing link then is GFS, which Red Hat recently open sourced. I want to play with it but I can't use it and trust people's data to it if I don't fully understand how to use it, manage it, fix it, back it up, etc.
(Reply) (Parent) (Thread)
From: snej
2004-07-07 05:08 pm (UTC)
I don't get it. What are the other zillions of Linux-based intranets in the world using for file sharing if NFS is so unusable?
(Reply) (Thread)
[User Picture]From: brad
2004-07-07 05:19 pm (UTC)
NFS works okay most the time. It's the corner cases and special setups where it starts to suck.

I think NFS is hated universally.
(Reply) (Parent) (Thread)
From: snej
2004-07-07 05:22 pm (UTC)
Oh, I already know plenty of reasons to hate it. Just hadn't heard of its being flat-out unreliable.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2004-07-07 05:30 pm (UTC)
Stale mounts and export limitations I could deal with.

It crossed the fucking line when it started corrupting data with pages of zero bytes.
(Reply) (Parent) (Thread)
[User Picture]From: xaosenkosmos
2004-07-07 08:29 pm (UTC)
*sticks fingers in ears* lalalalala!!! I can't hear you!!

The zero-byte crap makes me very nervous. We haven't seen it (as far as we know), but it makes me afraid. Have you done any work to isolate/identify the problem, or are you guys just moving along and trying to avoid it?
(Reply) (Parent) (Thread)
[User Picture]From: brad
2004-07-07 09:40 pm (UTC)
I think it happen{ed|s} only during load when there wasn't free memory to allocate for that NFS page. But what should've happened is an error, not a silent corruption. We rebooted the offending client with a kernel upgrade and it "fixed" it, but I'm not sure it's a result of the kernel upgrade or the corrupted caches being reset, or more memory being available.
(Reply) (Parent) (Thread)
[User Picture]From: visions
2004-07-07 09:11 pm (UTC)
an unrelated question, but what company are you guys using for your rackmount machines now? i did a search, but didn't come up with the name. i need to pick up some rack mount machines in the near future, so i am looking for a good vendor with decent prices.

to stay on topic a bit.. you might find these useful..
general overview of gfs
datasharing for webservers

again, not super technical, but it might give you enough of an overview to get started with gfs and move forward using it.
(Reply) (Thread)
[User Picture]From: brad
2004-07-07 09:39 pm (UTC)
I think I've read both those. I think I understand enough to get moving and start testing it, but I've just lacked the time so far.

Reading the GFS USENIX papers currently.
(Reply) (Parent) (Thread)
[User Picture]From: visions
2004-07-07 09:49 pm (UTC)
any comment as to what rackmount vendor you guys are using or recommend these days? you can email it to me privately if there is a disclosure issue.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2004-07-07 09:50 pm (UTC)
Oh, forgot that part.

Silicon Mechanics! http://www.siliconmechanics.com/

Totally fucking rock. I don't even look at other vendors anymore, I'm so happy.
(Reply) (Parent) (Thread)
[User Picture]From: peter_zaitsev
2004-07-07 10:02 pm (UTC)

What is about HA stuff

That is cool, but how are you going to maintain high availability of the things ? You perhaps want to keep couple of copies of file just in case ?

This actually looks pretty similar concept to mysqlfs which once exist,
but that was CORBA based so this is perhaps different technology.

(Reply) (Thread)
[User Picture]From: brad
2004-07-07 10:26 pm (UTC)

Re: What is about HA stuff

Each file in the system belongs to a class, and each class has a different minimum replica count. The filesystem automatically maintains the minimum replica count for all files, both when they're just created, and when devices die.

I've written about it about a dozen times now, but I haven't put it all in one place yet. Keep an eye out on my journal and I'll post a link to its web page once I make it.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2004-07-13 05:23 pm (UTC)
Try plan 9 if you want to see something that approaches "filesystems done right" a bit closer than any UNIX system.
(Reply) (Thread)