2007-07-04 02:14 am (UTC)
Didn't Linux just get them only recently? Which means I'd have to implement something else for portability anyway, so not quite worth it.
There's no simple fix to this problem. The correct solution is a single-producer (epoll waiter)/multiple-consumer model with a single queue for transmission of work to the workers.
2007-07-04 02:11 am (UTC)
Isn't that what I described?
2007-07-04 02:13 am (UTC)
As long as I have a socket to the child anyway, I don't gain anything that I can see from having the child do the accept.
Either way I have to:
1) wait for readability
2) notify child
It might even be better to have the child accept, as that's
spread out between multiple CPUs.
With fd passing, I'm just introducing portability and dependency problems.
How about not having every child put that listening socket fd in its select/epoll/queue set?
Instead have a token (or a few of them) that get passed among the child processes; if a child has the token then they put the fd in their select/epoll/etc., otherwise they don't. Once they accept a new connect they pass the token to the "next" child process. That at least lets you control the size of the herd, which will be equal to the number of tokens in circulation.
2007-07-04 03:02 am (UTC)
You're overblowing it
Thundering herd is an issue when you have a big herd (e.g. in Apache's 500 children situation).
How many children are you talking about here? In Qpsmtpd I basically launch one per CPU, and while yes, 4 processes get notified about readability, it's really not a big deal.
I did try the send_fd/read_fd method, and the pipe to children method, but neither really helped any as far as I could see.
Perhaps it's time to do some real world testing to see if it's really an issue for your app.
Feels kinda lame that I have to do this.
Any better suggestions?
yes- stop using threads and start using coroutines.
2007-07-04 09:22 am (UTC)
Welcome to the future where everybody has like 8 cores at least.
The thundering herd problem keeps coming up every few years.Fun recent thread
. Kevents! Event lossage! Blah Blah blah.
OS should just pick a thread. Syscall shouldn't dequeue an event until it's been totally served. But they don't?
Otherwise, almost everything you try to do to get around it is the same amount of work anyway. Some sort of listener/queue relationship.
2007-07-04 06:02 am (UTC)
LJ via iPhone. *giggles*
Hey! I'm drunk. That's a valid excuse. Fortunately, it (the phone, yo) fixes most of my typing mistakes.
I've got yer thundering hurd right here, bitch.
2007-07-04 07:19 am (UTC)
Are you sure you actually have a problem here? Eg, do you have actual strace()s of multiple child processes being woken up from accept() when a single incoming connection comes in? Or are you merely speculating that if this happened it would be bad?
I ask because one of the early "competitive benchmarks" of Linux/Apache and Windows/IIS uncovered this problem, around 1999 from what I can tell with a bit of googling, and a fix went into Linux 2.3 IIRC. (The fix being: pick one of the processes to wake up, and skip the rest.) So any vaguely modern Linux kernel shouldn't do that.
(the latter being a 2005 book which reports what the fix was -- search for "thundering herd" in the PDF -- but I'm pretty sure it'd been in the Linux kernel for years by the time that book came out)
IIRC *BSD followed suit with a similar fix, and Solaris and some of the other high performance unixes had a similar fix even before that.
So I'd suggest confirming you actually do have a problem before getting too carried away trying to fix it in userland...
2007-07-04 07:23 am (UTC)
Re: Thundering herd
I also thought this problem had been solved in the kernel.
apache use accept serialization via mutex
so only one process/thread blocks in accept call
others blocks in mutex
2007-07-04 10:44 am (UTC)
We actually moved to the model you described in SpamAssassin 3.1.0.
Previously we did the "preforked pool of servers all doing blocking accept"
thing, but that didn't allow scaling of the pool size to deal with demand. So
instead, I sat down with some of the Apache httpd guys, went over how Apache 2
preforking and pool scaling works, and came up with a pure-perl implementation.
That's what we now use. Here's the key bits:
Parent shares the accept fd with kids, and socketpairs between parent and each child, as you describe.
Parent process maintains pool size to always include N idle children, scales up/down children as the number of idle kids increases/decreases with load (Apache-style preforking).
Parent selects on the accept socket.
When a new incoming connection appears, it picks the lowest-numbered idle child and orders it to accept.
The children report back state ("idle", "busy") over the socketpair as they become idle, or are about to accept a connection.
The child waits for orders from the parent. If the parent orders a child to accept, it reports itself as "busy", accepts the conn, deals with it, then reports itself as "idle".
Note, btw, the use of lowest-numbered idle child; that's an easy way to keep
the same kid processes "hot". Apache httpd does the same thing (iirc). Since
the communication is therefore generally between processes that are swapped in,
and no swapping is required, this was a key benefit that makes this a little
faster than the traditional "preforked blocking accept" style, at least for
most casual users. (Of course a well monitored setup where the admin is
careful to ensure swap is never hit would probably be more efficient using
the traditional "blocking accept" model, so we still offer that; but most
people aren't that careful.)
We had a nasty bug that went on for a while on some loaded servers, but
eventually we got it fixed (deleting an entry from a hash in a SIGCHLD signal
is unsafe, doh!). Nowadays it seems to be quite stable.
The code's in our
SVN: lib/Mail/SpamAssassin/SpamdForkScaling.pm and
Out of curiosity, are you using EPOLLET?
Don't worry about it? Assuming one child per CPU, the herd is small.
Only idle children will wake up. Those doing work will miss the event as it will be handled before they enter the event loop.
An idle child on a non-idle CPU might not wake up for the same reason.
You are adding complexity to optimize an idle system at the expense of a non-idle system.
2007-07-08 06:49 pm (UTC)
The herd (the number of CPUs in a typical box) is currently small, but for how long? Already 8 CPU boxes are typical in our data center. And next year? I'd rather just do it right once and not worry about it in a few years.