|today's rage #1: portable advisory file locking
||[May. 27th, 2004|09:55 pm]
So at work we made this little network lock daemon, and client that's capable of using them all (or all that are alive) to take out locks. Tiny, tiny, works great.|
So then we're like, "Oh, we need a dumb fallback method for people on single hosts. Let's just do flock or fnctl or lockf or something.... that'll be easy...."
Either we're all crazy, or flock/fnctl/lockf are a total pain in the ass. Our lock stress tester (10 forked children fighting for a few seconds over a small number of locks) works perfectly with our network lock client/daemon..... the synchronized code (an O_CREAT|O_EXCL open + unlink) never fails.
But once we switch to using local-machine locking, reliability goes to hell.
What's wrong here?
It seems both fcntl and flock locks are released at different times, but I can't get it right. It only works if I comment out the line after BIZARRE in the source above. (never unlinking the file I flock)
Where's the race?
2004-05-27 10:30 pm (UTC)
2004-05-27 11:04 pm (UTC)
Apparently everybody didn't. Lesson learned.
2004-05-28 12:24 am (UTC)
Wait, does link() work on Windows? I doubt it.
We have a number of LJ developers on Windows I don't want to isolate. Suppose I could use some Win32 semaphore functions over there, though.
2004-05-28 12:34 am (UTC)
I did briefly look at some of the lockd stuff you're working on, I think, but I don't remember the actual semantics. If the goal is to provide your locks in terms of names, you could certainly just use CreateMutex and provide a name ala "Local\\LockD\\". You wanna be really careful to close those objects when you're done with them, though. Especially if 9x is a target.
2004-05-28 12:42 am (UTC)
The lockd daemon stuff already works. Everywhere.
I'm talking about the single machine case without the lock server. Can you create (on Windows) a mutex that dies when the process dies?
2004-05-28 12:48 am (UTC)
Dies in what sense? Loses it's lock, or the object is destroyed?
The process handle actually acts pretty much like this already. It goes into a signalled state when the program exits, which causes it to trigger on a WaitFor* function.
How you get that process handle kinda depends on what exactly you're doing, I suppose. CreateProcess returns it (but I don't think there's any easy way to get the clib wrappers to get it), if you're doing the spawning.
uhm- run lockd on localhost?
i'm probably missing something :P
2004-05-28 12:57 am (UTC)
I suspect the goal is to avoid people on small setups having to run Yet Another Daemon, and one that's kind of redundant on a single machine setup as well.
2004-05-28 12:58 am (UTC)
2004-05-28 05:25 am (UTC)
Given that running LJ on Windows is already a bit of a pain in the butt, I doubt most of your Win32-based developers actually run the server on Windows. I personally used to run it in VMWare on a linux install and now I have a linux box.
alarm() calls in S2.pm always trip me up, though, because S2 other than that works fine alone on Win32. I've always kinda thought it'd be cleaner for the
alarm() stuff to be done by the caller anyway… so that different S2 applications can have different timeouts or no timeout.
SysV semaphores are what you should be using on POSIX-compliant systems anyway, and the code will port better to Windows.jwz
's solution is nice... if it were still 1995. But SysV IPC actually works everywhere now. It's the Right tool for this job.
2004-05-28 12:19 pm (UTC)
Re: File locking bad. Semaphores good.
In what way is this "right tool" superior to the way I did it? Because if the old way still works, I don't see any reason to switch to a new way (which is not backward compatible.)
Because your old way comes with all these caveats of "don't do this" as you pointed out very carefully in the documented you referenced. Semaphores work and, if they fail to work, it's an OS bug and you get to yell at the vendor. Nobody really gives a shit if you yell about flock being broken these days.
I knew that using link() was the One True Way™ when you're trying to lock something over NFS, but didn't know that flock()/lockf() were fundamentally borked for real filesystems as well (at least under both Linux and Darwin). Good to know.
A: create lock file
B: open lock file
A: lock lock file
B: waiting on lock...
A: unlock lock file
A: remove lock file
C: create lock file
C: lock lock file
B: lock acquired
Welcome to race conditions.
Except that we're flock()ing with LOCK_EX|LOCK_NB, and returning a failure on EWOULDBLOCK.
You're right! So it leaks file descriptors... lots of them apparently.
Fixing up all the file descriptor leaks makes it stop failing. Oddly enough, the critical line is "close(file_fd);".
A: create lock
A: lock lock
B: open lock
A: unlink lock
C: create lock
A: unlock lock
B: acquire lock
C: acquire lock
Still a race.
2004-05-28 04:20 pm (UTC)
You may wish to consider lock() = mkdir(); unlock() = rmdir() for portability. On all modern systems I'm aware of mkdir() is atomic (this wasn't the case in some ancient systems though), as is rmdir(), and you can test for the presence of the lock (directory) with stat(). To the best of my knowledge it's also distributed-file-system (eg, NFS) safe where flock/fcntl may or may not be. (jwz
's suggestion of link() is also useful in a Unix/POSIX environment, but may be hard to emulate under Win32, etc, as you note in the comments.)