brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

splice() gets network receive [Jul. 17th, 2007|05:59 pm]
[Tags|, , ]

As a follow-up to my earlier excitement, more fun developments in splice() syscall land:

http://www.ussg.iu.edu/hypermail/linux/kernel/0707.1/1426.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0707.1/1427.html

Network receive support!

Getting closer.

Pretty soon Perlbal can do zero-copy network receive/send, without any copies to/from userspace.
Link3 comments|Leave a comment

Linux on Delta [Feb. 18th, 2007|11:01 am]
[Tags|, , ]

The in-seat entertainment on Delta (even in economy: impressive), was really good. Made my first flight from SFO to JFK quite bearable. Listened to music, watched a movie, etc.

And I noticed when it booted:



Heh. Apparently it runs Linux. You'd think they'd do some Delta-themed bootsplash.
Link13 comments|Leave a comment

"Generic AIO by scheduling stacks" [Jan. 30th, 2007|02:55 pm]
[Tags|, ]

Zach Brown just posted to lkml (a few minutes ago) ...

[PATCH 0 of 4] Generic AIO by scheduling stacks

It's a syscall to submit syscalls to run async. Then another syscall to async gather the results of the submitted syscalls as they complete. One of the most wonderful things I've seen in awhile! Any syscall!

And I'm especially happy that Linus loves it, so we should expect to see it sooner than later in real kernels.

Yay!
Link7 comments|Leave a comment

RAID-5 misc [Jan. 27th, 2007|03:11 pm]
[Tags|, ]

I never use RAID-5, so I'd never noticed this before:
   -f, --force
        Insist  that  mdadm  accept  the  geometry and layout
        specified without question.  Normally mdadm will  not
        allow  creation of an array with only one device, and
        will try to create a raid5  array  with  one  missing
        drive (as this makes the initial resync work faster).
        With --force, mdadm will not try to be so clever.

And indeed, when I created the array with 5 disks, it marked one as a spare:
# mdadm --detail /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Sat Jan 27 13:30:36 2007
     Raid Level : raid5
     Array Size : 1953545984 (1863.05 GiB 2000.43 GB)
    Device Size : 488386496 (465.76 GiB 500.11 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Sat Jan 27 13:52:08 2007
          State : clean, degraded, recovering
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

 Rebuild Status : 27% complete

           UUID : 5ad3ba82:30b256f3:c70f55c8:1f40abbd
         Events : 0.194

    Number   Major   Minor   RaidDevice State
       0       8       48        0      active sync   /dev/sdd
       1       8       64        1      active sync   /dev/sde
       2       8       80        2      active sync   /dev/sdf
       3       8       96        3      active sync   /dev/sdg
       5       8      112        4      spare rebuilding   /dev/sdh

And you can see that 4 disks are reading, and 1 is writing:
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
...
sdd             198.00     45824.00         0.00      45824          0
sde             198.00     45824.00         0.00      45824          0
sdf             200.00     45824.00         0.00      45824          0
sdg             203.00     46080.00         0.00      46080          0
sdh             203.00         0.00     46080.00          0      46080
...

Neat!

It makes sense why it's done this way: 4 disks doing sequential reads and 1 doing sequential writes is faster than 5 disks doing mixed reads and writes.

But really? 6 hours?

I'd prefer an option where all disks are zeroed, and then initial resync is skipped. Yes, array wouldn't be immediately usable like it is with the kernel doing the background sync for me, but I think I could zero a 460 GB disk quicker than 6 hours... based on 100 MB/s filesystem writes I saw, assuming I can do even more to a raw block device, should be about an hour) But I can't see how... --assume-clean may be what I'm looking for? Do I just zero all the devices myself first, then re-create the array?

I wouldn't normally mind, but I want to performance-test several configurations and 6 hour waits seriously kills my flow. :)
Link14 comments|Leave a comment

HDR and Linux [Nov. 18th, 2006|08:24 pm]
[Tags|, , , , ]

Anybody here do any HDR work with Linux?

I think I'm finally wrapping my head around all the concepts, formats, tools, processes, etc, but I've yet to do anything with any of it. Although I have some source images I'm eager to play with.

If my understanding is correct, there are basically two phases:

1) get multiple source images (on a tripod) of different exposures (shutter speed differences, not aperature) and run them through a tool to convert them into an HDR file (floating point pixel values, not 8-bit per channel). These file formats are either *.hdr or the ILM OpenEXR format? Or maybe some other formats. The tool mkhdr looks like it can do this, with some hand-holding. (you have to give it ppm files and shutter speeds on command line, since it can't read any raw files, but that's understandable because there's a dozen+ raw formats...)

2) given HDR file of some format (depending on tool), do cool shit with it. Canonical examples are various blurs that used to clip out highs, and "tone mapping", of which there are various algorithms it seems to reduce the HDR data down into something sexy for screen (which is low dynamic range)

So I guess my questions are:

-- best Linux tool for creating HDR images? is mkhdr good? I can automate the parameter hell.
-- best tools for converting between *.hdr and OpenEXR, etc? The imagemagick of HDR file formats?
-- best tone mapping algorithms to create the typical HDR photos you see online

Any pointers appreciated. Thanks!
Link5 comments|Leave a comment

djabberd: c10k? hah! [Jun. 26th, 2006|10:09 pm]
[Tags|, , , ]

DJabberd just did 25,200 (fully setup) connections with 97 MB of RAM before my Xen instance ran out of memory. It's now 3.4kB of overhead per connection (contrast to 30kB this morning) but there's still obvious ways to trim it down. Should be able to get it down to 2kB. The big win was when I implemented a [forget design pattern name] system where libxml parsers are shared, returned, kept on a freelist, etc.

From what Artur and I can tell, this is better than most/all the other jabber servers out there.

It means with 1GB of ram we can do 300k connections per process. (8GB of RAM boxes, 2x 2x core)

<3 epoll.
Link23 comments|Leave a comment

readahead / blocking sendfile [Jun. 5th, 2006|11:34 pm]
[Tags|, , , , ]

I've had a known inefficiency in Perlbal for ages now and finally broke down and fixed it. The inefficiency is that sendfile can block, even if the destination fd is a non-blocking socket, because the source fd (a disk-based file), can force a disk read if it's not already in pagecache.

FreeBSD has a fancy sendfile that lets you request it not block, but Linux doesn't.

The solution on Linux is to do a readahead() call first in another thread, or just sendfile() in another thread, either of which IO::AIO can do. I wanted to test the theory changing as little code as possible, so I went with the async readahead.

Before I did that, though, I wrote a test case.

The test case runs two processes in parallel: one fetching 3 small hot files over and over again, measuring the mean speed of 100 requests. The other process is there just to mess with the first one: it doesn't actually output anything. The second process either fetches the same 3 small files, or with the "big" parameter, fetches seven 100MB in a loop, more than this xen instances's 512 MB of memory. The idea is see if the disk reads serving the big files stall the event loop and decrease turn-around time.

Yup:

lj@LJ_web:~$ ./parallel.pl small; ./parallel.pl small; ./parallel.pl small; ./parallel.pl big;./parallel.pl big;./parallel.pl big;
mean: 0.287987213134766, stddev: 0.0829109309255669
mean: 0.279777903556824, stddev: 0.0957734761804354
mean: 0.238886480331421, stddev: 0.0949280425469577

mean: 0.351436612606049, stddev: 0.0791952577383974
mean: 0.361295075416565, stddev: 0.0863025646086743
mean: 0.3904807305336, stddev: 0.173639453608837

The first set of three lines is the time to serve small files with other small files being served in the background. The second send of three is serving small files with big files being served in the background.

Adding in the async readahead() call, doing the sendfile in the callback (once the data to be sendfile'd is in the pagecache), and the results even out a bunch:

lj@LJ_web:~$ ./parallel.pl small; ./parallel.pl small; ./parallel.pl small; ./parallel.pl big;./parallel.pl big;./parallel.pl big;
mean: 0.296060967445374, stddev: 0.0586433388625736
mean: 0.262518639564514, stddev: 0.0726501212827927
mean: 0.285000162124634, stddev: 0.0321991597111094

mean: 0.302280473709106, stddev: 0.0811435349061447
mean: 0.303003549575806, stddev: 0.0787071540895621
mean: 0.298841729164124, stddev: 0.0953137343692458

Probably some more work to be done, but promising.
Link10 comments|Leave a comment

lutimes(2) and Linux [Apr. 1st, 2006|04:18 pm]
[Tags|, ]

This homeboy is sad that lutimes(2) isn't implemented in Linux.

I didn't even know of lutimes earlier today, because I didn't even know of utime()/utimes(). I was like, "How can I get brackup to restore modtimes? I know tar and rsync do it." So I straced tar, found utime/utimes, was happy, implemented, then found it didn't work on symlinks (or rather, it tried to follow symlinks). Told Whitaker about lstat (vs stat), then he googled lutimes, found it, told me, and I find that Linux
doesn't implement it.

And that was after I straced tar on symlinks and found it didn't do anything, so I had little hope anyway.

But this breaks my brackup test suite which compares the output of "ls -lR" on backup dir and restored dir. So now I have to compare instead some new serialization of the before/after directories, ignoring symlink modtimes. Lame. I thought I was done.

Oh yeah, work on Brackup continues. It restores now. Coming soon to svn and CPAN near you.
LinkLeave a comment

splice() [Mar. 30th, 2006|10:38 am]
[Tags|, , , ]
[Current Mood | excited]

Is anybody else excited about the in-development splice() system call?

I've wanted this for, like, ever.

I need to add support to Sys::Syscall so Perlbal can use it, avoiding copies to/from userspace to/from sockets.
Link18 comments|Leave a comment

wish before I leave; cdrecord; Jörg [Feb. 17th, 2006|08:24 pm]
[Tags|, ]

When I come back from Belize, I hope somebody has kicked Jörg Schilling in the metaphoric nuts by forking cdrecord so I don't have to see his name on the lkml anymore.

I guess Debian's already done that, a bit, but I want it done more.
Link5 comments|Leave a comment

OpenSolaris on Xen [Feb. 14th, 2006|12:07 pm]
[Tags|, , ]

OpenSolaris just gained hardware support:

http://blogs.sun.com/roller/page/tpm?entry=opening_day_for_opensolaris_on

(by running OpenSolaris as a domU with a Linux dom0, OpenSolaris can use any NIC, any block device, etc....)
Link1 comment|Leave a comment

linux-kernel; inlines [Jan. 2nd, 2006|09:37 pm]
[Tags|, ]

I'm addicted to reading the linux-kernel list. There's a big thread going on about changing the meaning of "inline" in the kernel tree to mean "if gcc4 wants to" instead of the historical "always-inline!" that was required due to gcc3 quirks. Then introducing a new "__always_inline" to actually mean __attribute__((always_inline)), for the few places in the kernel that require inline.

I guess the whole argument is that inline has turned into a "ricing option" that programmers throw about for tons of bogus reasons, not understanding gcc, not understanding other architectures, etc. Hence the patches to remove them all and just let the compiler do it, because it can't get any worse.

I liked this post from Ingo:
....
furthermore, there's also a new CPU-architecture argument: the cost of
icache misses has gone up disproportionally over the past couple of
years, because on the first hand lots of instruction-scheduling
'metadata' got embedded into the L1 cache (like what used to be the BTB
cache), and secondly because the (physical) latency gap between L1 cache
and L2 cache has increased. Thirdly, CPUs are much better at untangling
data dependencies, hence more compact but also more complex code can
still perform well. So the L1 icache is more important than it used to
be, and small code size is more important than raw cycle count - _and_
small code has less of a speed hit than it used to have. x86 CPUs have
become simple JIT compilers, and code size reductions tend to become the
best way to inform the CPU of what operations we want to compute.
...
Link20 comments|Leave a comment

Xen merge [Nov. 1st, 2005|10:48 am]
[Tags|, , ]

Article about Xen merge into the kernel:
http://www.eweek.com/article2/0,1895,1879667,00.asp
Link5 comments|Leave a comment

Xen [Oct. 9th, 2005|04:30 pm]
[Tags|, ]

My home server is now running atop Xen [about]. I love it so.
Link12 comments|Leave a comment

navigation
[ viewing | most recent entries ]