?

Log in

thundering herd; bleh - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

thundering herd; bleh [Jul. 3rd, 2007|06:06 pm]
Brad Fitzpatrick
[Tags|, , ]

Imagine the following scenario:
  • Parent process opens listening socket, but will never accept() on it. Opening in parent so future forked child processes will inherit the fd, and because it's a low port (perhaps) and we want to open it before we drop root.
  • Child processes (one per CPU) inherit listening fd, but they're event-based, handling many connections each, so they can't do a blocking accept() on it...
See where this is going? :-)
  • So if each child has readability of that listening socket fd in its select/epoll/queue set, guess what? Thundering herd. The kernel wakes up all processes, but only one accepts the new client connection.
So instead I guess I have to...
  • have a socketpair between parent and each child for communication,
  • watch readability of the listening fd in parent process,
  • on readability, send a message to "best" child (for some definition of best) to accept on the listening socket which it has, but isn't watching readability on.
  • perhaps have children tell the parent process when connections go away, so parent knows approximate load on each child.
Feels kinda lame that I have to do this.

Any better suggestions?
LinkReply

Comments:
[User Picture]From: taral
2007-07-04 01:48 am (UTC)
Message queues?
(Reply) (Thread)
[User Picture]From: brad
2007-07-04 02:14 am (UTC)
Didn't Linux just get them only recently? Which means I'd have to implement something else for portability anyway, so not quite worth it.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2007-07-04 02:51 am (UTC)
Message queues? No, they've been around for a while (2.6.6) and are very portable.
(Reply) (Parent) (Thread)
[User Picture]From: grumpy_sysadmin
2007-07-04 07:25 am (UTC)
Or, you know, since POSIX defined them. Back before the dawn of time. Just cause Linux is late to the game...
(Reply) (Parent) (Thread)
[User Picture]From: grumpy_sysadmin
2007-07-04 07:26 am (UTC)
Oh oh. I thought you WERE describing SysV message queues. Not some heretical other construct. Please to disregard.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2007-07-04 02:52 am (UTC)
And if you're using 2.4, you can use SYSV message queues, which work just as well (if not better, since they don't require that stupid mqueue filesystem).
(Reply) (Parent) (Thread)
[User Picture]From: taral
2007-07-04 01:54 am (UTC)
There's no simple fix to this problem. The correct solution is a single-producer (epoll waiter)/multiple-consumer model with a single queue for transmission of work to the workers.
(Reply) (Thread)
[User Picture]From: brad
2007-07-04 02:11 am (UTC)
Isn't that what I described?
(Reply) (Parent) (Thread)
[User Picture]From: taral
2007-07-04 02:50 am (UTC)
Not quite. Since the workers are blocking on the queue, only one of them will be woken up when work is available, and work is guaranteed to be picked up by the first available worker. You don't have to do anything about "best".
(Reply) (Parent) (Thread)
[User Picture]From: valiskeogh
2007-07-04 02:43 am (UTC)
When my child isn't listening to what i (the parent) am saying, i always find the best thing to do is throw something at him, usually a rolled up sock (socket in french i guess), or if that's not available, a toy truck

i only have one so he'a always the "best" child.
(Reply) (Parent) (Thread)
[User Picture]From: crucially
2007-07-04 02:09 am (UTC)
Accept in the parent and pass the fd over to the child that should have it?

Something like http://search.cpan.org/~addi/File-FDpasser-0.09/ does it.

(Reply) (Thread)
[User Picture]From: brad
2007-07-04 02:13 am (UTC)
As long as I have a socket to the child anyway, I don't gain anything that I can see from having the child do the accept.

Either way I have to:

1) wait for readability
2) notify child
3) accept

It might even be better to have the child accept, as that's
spread out between multiple CPUs.

With fd passing, I'm just introducing portability and dependency problems.

(Reply) (Parent) (Thread)
[User Picture]From: nothings
2007-07-04 02:51 am (UTC)
As long as I have a socket to the child anyway, I don't gain anything that I can see from having the child do the accept.

Assuming '[no advantage from] having the child accept' is a typo for '[no advantage from] having the parent accept', from context...

I know NOTHING about this in practice (I once wrote code that called select()... in 1990), but: if you accept in the child, how does the parent know when the _next_ connection is available? It doesn't seem like it can if it's just 'watching for readability', unless there's actually something subtler in the how it's getting info about incoming connections. If not, you can't actually spread the accept load across the children because you can't get two accepts running 'at once'. If all this is true, given two incoming connections, it seems like the fastest way to get the second one accepted would be accepting in the parent.
(Reply) (Parent) (Thread)
[User Picture]From: crucially
2007-07-04 03:39 am (UTC)
I am pretty sure ther is a syscall to figure out how many pending connections there are, but I am too lazy too look up my Stevens right now.
(Reply) (Parent) (Thread)
[User Picture]From: nothings
2007-07-04 04:06 am (UTC)
Still not good enough! The problem is you still have no way of knowing whether the N connections waiting to be accepted are new or have already been messaged-off to a child.

Imagine you check and you have 1 waiting. so you send a message to a child 'hey, accept this'. then you check and see 0. hooray, then you check and see 1, ok, that's new. That case works.

Now imagine you check and you have 1 waiting. so you send a message to a child 'hey, accept this'. then you check and see 1. ok, I guess the child hasn't gotten it yet. then, between the times you check, the child accepts, and another connection comes in. next time you check, it's still at 1.
(Reply) (Parent) (Thread)
[User Picture]From: crucially
2007-07-04 04:15 am (UTC)
good point, you would need to ack back from the child that it actually accepted it
(Reply) (Parent) (Thread)
[User Picture]From: crucially
2007-07-04 04:18 am (UTC)
use a locking algorithm between the children so only one ever listens, a shared mutex for example

or just have the master control which child is now assigned to be the acceptor, and if the child goes away or gets to loaded, moves it to another child

when a child gets to be the acceptor it start watching readability and accepting?
(Reply) (Parent) (Thread)
From: ext_53623
2007-07-04 03:01 am (UTC)

trim the herd

How about not having every child put that listening socket fd in its select/epoll/queue set?

Instead have a token (or a few of them) that get passed among the child processes; if a child has the token then they put the fd in their select/epoll/etc., otherwise they don't. Once they accept a new connect they pass the token to the "next" child process. That at least lets you control the size of the herd, which will be equal to the number of tokens in circulation.

(Reply) (Thread)
From: baudehlo
2007-07-04 03:02 am (UTC)

You're overblowing it

Thundering herd is an issue when you have a big herd (e.g. in Apache's 500 children situation).

How many children are you talking about here? In Qpsmtpd I basically launch one per CPU, and while yes, 4 processes get notified about readability, it's really not a big deal.

I did try the send_fd/read_fd method, and the pipe to children method, but neither really helped any as far as I could see.

Perhaps it's time to do some real world testing to see if it's really an issue for your app.
(Reply) (Thread)
(Deleted comment)
[User Picture]From: scosol
2007-07-04 04:34 am (UTC)
Feels kinda lame that I have to do this.

Any better suggestions?


yes- stop using threads and start using coroutines.
(Reply) (Thread)
[User Picture]From: brad
2007-07-04 09:22 am (UTC)
*sigh*

Welcome to the future where everybody has like 8 cores at least.
(Reply) (Parent) (Thread)
[User Picture]From: scosol
2007-07-06 11:44 pm (UTC)
in my future the VM handles CPU-spread and exec-ready status is abstracted out through another event layer.
life is too short to worry about such things :)
http://www.iolanguage.com/

to thin the herd a logical mutex has to happen somewhere- and by definition it must block- and once you do that you're no longer event-based-

can't have it both ways :)
(Reply) (Parent) (Thread)
[User Picture]From: dormando
2007-07-04 05:50 am (UTC)
bah!

The thundering herd problem keeps coming up every few years.

Fun recent thread. Kevents! Event lossage! Blah Blah blah.

OS should just pick a thread. Syscall shouldn't dequeue an event until it's been totally served. But they don't?

Otherwise, almost everything you try to do to get around it is the same amount of work anyway. Some sort of listener/queue relationship.
(Reply) (Thread)
[User Picture]From: ydna
2007-07-04 06:02 am (UTC)
LJ via iPhone. *giggles*

Hey! I'm drunk. That's a valid excuse. Fortunately, it (the phone, yo) fixes most of my typing mistakes.

:p

I've got yer thundering hurd right here, bitch.
(Reply) (Thread)
[User Picture]From: edm
2007-07-04 07:19 am (UTC)

Thundering herd

Are you sure you actually have a problem here? Eg, do you have actual strace()s of multiple child processes being woken up from accept() when a single incoming connection comes in? Or are you merely speculating that if this happened it would be bad?

I ask because one of the early "competitive benchmarks" of Linux/Apache and Windows/IIS uncovered this problem, around 1999 from what I can tell with a bit of googling, and a fix went into Linux 2.3 IIRC. (The fix being: pick one of the processes to wake up, and skip the rest.) So any vaguely modern Linux kernel shouldn't do that.

Eg,

http://lwn.net/1999/0513/kernel.php3
http://lwn.net/images/pdf/LDD3/ch06.pdf

(the latter being a 2005 book which reports what the fix was -- search for "thundering herd" in the PDF -- but I'm pretty sure it'd been in the Linux kernel for years by the time that book came out)

IIRC *BSD followed suit with a similar fix, and Solaris and some of the other high performance unixes had a similar fix even before that.

So I'd suggest confirming you actually do have a problem before getting too carried away trying to fix it in userland...

Ewen
(Reply) (Thread)
[User Picture]From: fanf
2007-07-04 07:23 am (UTC)

Re: Thundering herd

I also thought this problem had been solved in the kernel.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2007-07-04 09:23 am (UTC)

Re: Thundering herd

I observed it. If you're doing a blocking accept, the kernel only wakes up one process. If you're just waiting for readability, the kernel wakes up all.
(Reply) (Parent) (Thread)
[User Picture]From: fanf
2007-07-04 10:17 am (UTC)

Re: Thundering herd

Ah I guess they optimised for Apache rather than for the threaded case.

If you have children handling multiple sockets, are they not likely to have other activity to deal with and so they will get woken up anyway? Yes, you'll have several wasted accepts, but presumably you'll have only a few processes (one per core) not hundreds (one per socket) as in the original thundering herd scenario. Just wondering if it's really a serious problem.
(Reply) (Parent) (Thread)
[User Picture]From: edm
2007-07-04 07:28 am (UTC)

Re: Thundering herd

On rereading your post more carefully I see your problem is a bit more specialised than just sitting in accept() due to the event handling architecture. But I'd be surprised if you were the first to encounter this problem with that architecture though, so there may already be a fixed included particularly if using something newer than select() to wait on events.

The other idea that immediately comes to mind is that you could use a kernel thread (in each separate process) to turn the problem into the solved sitting-in-accept() form. Ie, spawn a kernel thread that sits in accept(), and generates events to the processing thread of that process saying "new connection". Providing the competing accept()s get some level of load balancing in the kernel you should get your desired load balancing.

Ewen
(Reply) (Parent) (Thread)
From: urykhy
2007-07-04 07:41 am (UTC)
apache use accept serialization via mutex

so only one process/thread blocks in accept call
others blocks in mutex

(Reply) (Thread)
From: jmason
2007-07-04 10:44 am (UTC)

SA preforking

We actually moved to the model you described in SpamAssassin 3.1.0.

Previously we did the "preforked pool of servers all doing blocking accept" thing, but that didn't allow scaling of the pool size to deal with demand. So instead, I sat down with some of the Apache httpd guys, went over how Apache 2 preforking and pool scaling works, and came up with a pure-perl implementation. That's what we now use. Here's the key bits:

  • Parent shares the accept fd with kids, and socketpairs between parent and each child, as you describe.
  • Parent process maintains pool size to always include N idle children, scales up/down children as the number of idle kids increases/decreases with load (Apache-style preforking).
  • Parent selects on the accept socket.
  • When a new incoming connection appears, it picks the lowest-numbered idle child and orders it to accept.
  • The children report back state ("idle", "busy") over the socketpair as they become idle, or are about to accept a connection.
  • The child waits for orders from the parent. If the parent orders a child to accept, it reports itself as "busy", accepts the conn, deals with it, then reports itself as "idle".

Note, btw, the use of lowest-numbered idle child; that's an easy way to keep the same kid processes "hot". Apache httpd does the same thing (iirc). Since the communication is therefore generally between processes that are swapped in, and no swapping is required, this was a key benefit that makes this a little faster than the traditional "preforked blocking accept" style, at least for most casual users. (Of course a well monitored setup where the admin is careful to ensure swap is never hit would probably be more efficient using the traditional "blocking accept" model, so we still offer that; but most people aren't that careful.)

We had a nasty bug that went on for a while on some loaded servers, but eventually we got it fixed (deleting an entry from a hash in a SIGCHLD signal handler is unsafe, doh!). Nowadays it seems to be quite stable.

The code's in our SVN: lib/Mail/SpamAssassin/SpamdForkScaling.pm and lib/Mail/SpamAssassin/SubProcBackChannel.pm .

(Reply) (Thread)
[User Picture]From: taral
2007-07-04 02:54 pm (UTC)
Out of curiosity, are you using EPOLLET?
(Reply) (Thread)
[User Picture]From: davidphillips
2007-07-08 02:00 am (UTC)
Don't worry about it? Assuming one child per CPU, the herd is small.

Only idle children will wake up. Those doing work will miss the event as it will be handled before they enter the event loop.

An idle child on a non-idle CPU might not wake up for the same reason.

You are adding complexity to optimize an idle system at the expense of a non-idle system.
(Reply) (Thread)
[User Picture]From: brad
2007-07-08 06:49 pm (UTC)
The herd (the number of CPUs in a typical box) is currently small, but for how long? Already 8 CPU boxes are typical in our data center. And next year? I'd rather just do it right once and not worry about it in a few years.
(Reply) (Parent) (Thread)