?

Log in

No account? Create an account
memcached and linux - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

memcached and linux [Apr. 8th, 2004|11:56 am]
Brad Fitzpatrick
I found this thread interesting:

Large slab cache in 2.6.1
http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/index.html

My understanding at this point:

Linux has an LRU for cache items, but the cache items are spread across pages all over the slab.

When the VM needs memory, it pressures the caches into shrinking themselves to give up pages. But since they're just freeing off their LRU, they incredibly often don't even give up pages, they just free up slab items across a bunch of misc pages.

Experiments were apparently done to actually LRU the slab pages themselves, and not the items, but since they had no way apparently to actually free all items on the slab page (why? locking?) it only effectively added reclaim pressure from the vfs code, so they just did that instead. (or something)

This is the best analysis of the problem:

http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/1908.html
> That was Ed. Because we cannot reclaim
> slab pages direct from the LRU

I think that this is needed: Bonwick's slab algorithm (i.e. two-level linked lists, implemented in cache_alloc_refill and free_block) is intended for unfreeable objects.
The dentry cache is a cache of freeable objects - a different algorithm would be more efficient for shrinking the dentry cache after an updatedb.
I had started prototyping, but didn't get far.
--
Manfred
So I think this is all pretty lame, but memcached has exactly the same problem. (trying to use a slab allocator as a cache, and freeing objects, not pages...)

So this has me thinking about memcached improvements:

-- instead of 1MB slab pages, reduce that to, say, 128k

-- higher granularity size classes (not just powers of two). this'll solve our problem of only utilizing ~65% of memory.

-- global per-slab LRU, not one LRU for items per-size-class. this'll solve our problem of having to run the slab-page-balancer script against memcached when usage patterns change over time. (currently memcached never re-classes pages to different sized objects on demand)

-- fragment items between slabs, so a 1MB object can go over, say, 9 pages. (would be 8, but struct overhead). this is nowhere as bad as the TLB thrashing that would've been involved with our old idea (never implemented due to fear) of just using a bunch of 64-byte chunks. (the idea that "memory access is free" inspired that idea, but that's not totally true...)

-- when a new item needs to be added, free slabs, not items. will require iterating over the slab page to find the items, but we already do that at one point (during our ghetto manual page reassignment code) and thus are already guaranteeing empty parts of slabs will be zeroed out. when freeing a middle fragment of an item, the item free code will need to jump to the top, then walk the fragments, zeroing out each part on each relevant slab page.

-- if we encounter a slab that can't be free due to an item that can't be freed due to a refcount > 0 (transfer in progress), move to the next slab up the LRU and try again, up to ~20 tries. (just like we do now with the item LRU)

avva/others... thoughts?
LinkReply

Comments:
[User Picture]From: way2tired
2004-04-08 12:09 pm (UTC)
Only downside I see is you'll have 8x as many slabs. I'm not sure I understand the rest well enough to see other problems, though it also makes me think about problems with objects that use more than 128k. (maybe there are none?) I wouldn't think you'd want objects spanning slabs if at all possible.
(Reply) (Thread)
[User Picture]From: brad
2004-04-08 12:11 pm (UTC)
I don't see how having more slabs would hurt anything, though.
(Reply) (Parent) (Thread)
[User Picture]From: way2tired
2004-04-08 12:50 pm (UTC)
In trying to free slabs, wouldn't 8x as many of them slow down the iteration?
(Reply) (Parent) (Thread)
[User Picture]From: brad
2004-04-08 01:32 pm (UTC)
We never iterate over the slab list.

When I said iterate I meant over the slab itself to free the items.
(Reply) (Parent) (Thread)