Log in

No account? Create an account
thundering herd; bleh - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

thundering herd; bleh [Jul. 3rd, 2007|06:06 pm]
Brad Fitzpatrick
[Tags|, , ]

Imagine the following scenario:
  • Parent process opens listening socket, but will never accept() on it. Opening in parent so future forked child processes will inherit the fd, and because it's a low port (perhaps) and we want to open it before we drop root.
  • Child processes (one per CPU) inherit listening fd, but they're event-based, handling many connections each, so they can't do a blocking accept() on it...
See where this is going? :-)
  • So if each child has readability of that listening socket fd in its select/epoll/queue set, guess what? Thundering herd. The kernel wakes up all processes, but only one accepts the new client connection.
So instead I guess I have to...
  • have a socketpair between parent and each child for communication,
  • watch readability of the listening fd in parent process,
  • on readability, send a message to "best" child (for some definition of best) to accept on the listening socket which it has, but isn't watching readability on.
  • perhaps have children tell the parent process when connections go away, so parent knows approximate load on each child.
Feels kinda lame that I have to do this.

Any better suggestions?

[User Picture]From: taral
2007-07-04 01:48 am (UTC)
Message queues?
(Reply) (Thread)
[User Picture]From: brad
2007-07-04 02:14 am (UTC)
Didn't Linux just get them only recently? Which means I'd have to implement something else for portability anyway, so not quite worth it.
(Reply) (Parent) (Thread) (Expand)
[User Picture]From: taral
2007-07-04 01:54 am (UTC)
There's no simple fix to this problem. The correct solution is a single-producer (epoll waiter)/multiple-consumer model with a single queue for transmission of work to the workers.
(Reply) (Thread)
[User Picture]From: brad
2007-07-04 02:11 am (UTC)
Isn't that what I described?
(Reply) (Parent) (Thread) (Expand)
[User Picture]From: crucially
2007-07-04 02:09 am (UTC)
Accept in the parent and pass the fd over to the child that should have it?

Something like http://search.cpan.org/~addi/File-FDpasser-0.09/ does it.

(Reply) (Thread)
[User Picture]From: brad
2007-07-04 02:13 am (UTC)
As long as I have a socket to the child anyway, I don't gain anything that I can see from having the child do the accept.

Either way I have to:

1) wait for readability
2) notify child
3) accept

It might even be better to have the child accept, as that's
spread out between multiple CPUs.

With fd passing, I'm just introducing portability and dependency problems.

(Reply) (Parent) (Thread)
(Deleted comment)
(Deleted comment)
From: ext_53623
2007-07-04 03:01 am (UTC)

trim the herd

How about not having every child put that listening socket fd in its select/epoll/queue set?

Instead have a token (or a few of them) that get passed among the child processes; if a child has the token then they put the fd in their select/epoll/etc., otherwise they don't. Once they accept a new connect they pass the token to the "next" child process. That at least lets you control the size of the herd, which will be equal to the number of tokens in circulation.

(Reply) (Thread)
From: baudehlo
2007-07-04 03:02 am (UTC)

You're overblowing it

Thundering herd is an issue when you have a big herd (e.g. in Apache's 500 children situation).

How many children are you talking about here? In Qpsmtpd I basically launch one per CPU, and while yes, 4 processes get notified about readability, it's really not a big deal.

I did try the send_fd/read_fd method, and the pipe to children method, but neither really helped any as far as I could see.

Perhaps it's time to do some real world testing to see if it's really an issue for your app.
(Reply) (Thread)
[User Picture]From: scosol
2007-07-04 04:34 am (UTC)
Feels kinda lame that I have to do this.

Any better suggestions?

yes- stop using threads and start using coroutines.
(Reply) (Thread)
[User Picture]From: brad
2007-07-04 09:22 am (UTC)

Welcome to the future where everybody has like 8 cores at least.
(Reply) (Parent) (Thread) (Expand)
[User Picture]From: dormando
2007-07-04 05:50 am (UTC)

The thundering herd problem keeps coming up every few years.

Fun recent thread. Kevents! Event lossage! Blah Blah blah.

OS should just pick a thread. Syscall shouldn't dequeue an event until it's been totally served. But they don't?

Otherwise, almost everything you try to do to get around it is the same amount of work anyway. Some sort of listener/queue relationship.
(Reply) (Thread)
[User Picture]From: ydna
2007-07-04 06:02 am (UTC)
LJ via iPhone. *giggles*

Hey! I'm drunk. That's a valid excuse. Fortunately, it (the phone, yo) fixes most of my typing mistakes.


I've got yer thundering hurd right here, bitch.
(Reply) (Thread)
[User Picture]From: edm
2007-07-04 07:19 am (UTC)

Thundering herd

Are you sure you actually have a problem here? Eg, do you have actual strace()s of multiple child processes being woken up from accept() when a single incoming connection comes in? Or are you merely speculating that if this happened it would be bad?

I ask because one of the early "competitive benchmarks" of Linux/Apache and Windows/IIS uncovered this problem, around 1999 from what I can tell with a bit of googling, and a fix went into Linux 2.3 IIRC. (The fix being: pick one of the processes to wake up, and skip the rest.) So any vaguely modern Linux kernel shouldn't do that.



(the latter being a 2005 book which reports what the fix was -- search for "thundering herd" in the PDF -- but I'm pretty sure it'd been in the Linux kernel for years by the time that book came out)

IIRC *BSD followed suit with a similar fix, and Solaris and some of the other high performance unixes had a similar fix even before that.

So I'd suggest confirming you actually do have a problem before getting too carried away trying to fix it in userland...

(Reply) (Thread)
[User Picture]From: fanf
2007-07-04 07:23 am (UTC)

Re: Thundering herd

I also thought this problem had been solved in the kernel.
(Reply) (Parent) (Thread) (Expand)
From: urykhy
2007-07-04 07:41 am (UTC)
apache use accept serialization via mutex

so only one process/thread blocks in accept call
others blocks in mutex

(Reply) (Thread)
From: jmason
2007-07-04 10:44 am (UTC)

SA preforking

We actually moved to the model you described in SpamAssassin 3.1.0.

Previously we did the "preforked pool of servers all doing blocking accept" thing, but that didn't allow scaling of the pool size to deal with demand. So instead, I sat down with some of the Apache httpd guys, went over how Apache 2 preforking and pool scaling works, and came up with a pure-perl implementation. That's what we now use. Here's the key bits:

  • Parent shares the accept fd with kids, and socketpairs between parent and each child, as you describe.
  • Parent process maintains pool size to always include N idle children, scales up/down children as the number of idle kids increases/decreases with load (Apache-style preforking).
  • Parent selects on the accept socket.
  • When a new incoming connection appears, it picks the lowest-numbered idle child and orders it to accept.
  • The children report back state ("idle", "busy") over the socketpair as they become idle, or are about to accept a connection.
  • The child waits for orders from the parent. If the parent orders a child to accept, it reports itself as "busy", accepts the conn, deals with it, then reports itself as "idle".

Note, btw, the use of lowest-numbered idle child; that's an easy way to keep the same kid processes "hot". Apache httpd does the same thing (iirc). Since the communication is therefore generally between processes that are swapped in, and no swapping is required, this was a key benefit that makes this a little faster than the traditional "preforked blocking accept" style, at least for most casual users. (Of course a well monitored setup where the admin is careful to ensure swap is never hit would probably be more efficient using the traditional "blocking accept" model, so we still offer that; but most people aren't that careful.)

We had a nasty bug that went on for a while on some loaded servers, but eventually we got it fixed (deleting an entry from a hash in a SIGCHLD signal handler is unsafe, doh!). Nowadays it seems to be quite stable.

The code's in our SVN: lib/Mail/SpamAssassin/SpamdForkScaling.pm and lib/Mail/SpamAssassin/SubProcBackChannel.pm .

(Reply) (Thread)
[User Picture]From: taral
2007-07-04 02:54 pm (UTC)
Out of curiosity, are you using EPOLLET?
(Reply) (Thread)
[User Picture]From: davidphillips
2007-07-08 02:00 am (UTC)
Don't worry about it? Assuming one child per CPU, the herd is small.

Only idle children will wake up. Those doing work will miss the event as it will be handled before they enter the event loop.

An idle child on a non-idle CPU might not wake up for the same reason.

You are adding complexity to optimize an idle system at the expense of a non-idle system.
(Reply) (Thread)
[User Picture]From: brad
2007-07-08 06:49 pm (UTC)
The herd (the number of CPUs in a typical box) is currently small, but for how long? Already 8 CPU boxes are typical in our data center. And next year? I'd rather just do it right once and not worry about it in a few years.
(Reply) (Parent) (Thread)