June 5th, 2006


readahead / blocking sendfile

I've had a known inefficiency in Perlbal for ages now and finally broke down and fixed it. The inefficiency is that sendfile can block, even if the destination fd is a non-blocking socket, because the source fd (a disk-based file), can force a disk read if it's not already in pagecache.

FreeBSD has a fancy sendfile that lets you request it not block, but Linux doesn't.

The solution on Linux is to do a readahead() call first in another thread, or just sendfile() in another thread, either of which IO::AIO can do. I wanted to test the theory changing as little code as possible, so I went with the async readahead.

Before I did that, though, I wrote a test case.

The test case runs two processes in parallel: one fetching 3 small hot files over and over again, measuring the mean speed of 100 requests. The other process is there just to mess with the first one: it doesn't actually output anything. The second process either fetches the same 3 small files, or with the "big" parameter, fetches seven 100MB in a loop, more than this xen instances's 512 MB of memory. The idea is see if the disk reads serving the big files stall the event loop and decrease turn-around time.


lj@LJ_web:~$ ./parallel.pl small; ./parallel.pl small; ./parallel.pl small; ./parallel.pl big;./parallel.pl big;./parallel.pl big;
mean: 0.287987213134766, stddev: 0.0829109309255669
mean: 0.279777903556824, stddev: 0.0957734761804354
mean: 0.238886480331421, stddev: 0.0949280425469577

mean: 0.351436612606049, stddev: 0.0791952577383974
mean: 0.361295075416565, stddev: 0.0863025646086743
mean: 0.3904807305336, stddev: 0.173639453608837

The first set of three lines is the time to serve small files with other small files being served in the background. The second send of three is serving small files with big files being served in the background.

Adding in the async readahead() call, doing the sendfile in the callback (once the data to be sendfile'd is in the pagecache), and the results even out a bunch:

lj@LJ_web:~$ ./parallel.pl small; ./parallel.pl small; ./parallel.pl small; ./parallel.pl big;./parallel.pl big;./parallel.pl big;
mean: 0.296060967445374, stddev: 0.0586433388625736
mean: 0.262518639564514, stddev: 0.0726501212827927
mean: 0.285000162124634, stddev: 0.0321991597111094

mean: 0.302280473709106, stddev: 0.0811435349061447
mean: 0.303003549575806, stddev: 0.0787071540895621
mean: 0.298841729164124, stddev: 0.0953137343692458

Probably some more work to be done, but promising.