Large slab cache in 2.6.1
http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/index.html
My understanding at this point:
Linux has an LRU for cache items, but the cache items are spread across pages all over the slab.
When the VM needs memory, it pressures the caches into shrinking themselves to give up pages. But since they're just freeing off their LRU, they incredibly often don't even give up pages, they just free up slab items across a bunch of misc pages.
Experiments were apparently done to actually LRU the slab pages themselves, and not the items, but since they had no way apparently to actually free all items on the slab page (why? locking?) it only effectively added reclaim pressure from the vfs code, so they just did that instead. (or something)
This is the best analysis of the problem:
http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/1908.html
> That was Ed. Because we cannot reclaimSo I think this is all pretty lame, but memcached has exactly the same problem. (trying to use a slab allocator as a cache, and freeing objects, not pages...)
> slab pages direct from the LRU
I think that this is needed: Bonwick's slab algorithm (i.e. two-level linked lists, implemented in cache_alloc_refill and free_block) is intended for unfreeable objects.
The dentry cache is a cache of freeable objects - a different algorithm would be more efficient for shrinking the dentry cache after an updatedb.
I had started prototyping, but didn't get far.
--
Manfred
So this has me thinking about memcached improvements:
-- instead of 1MB slab pages, reduce that to, say, 128k
-- higher granularity size classes (not just powers of two). this'll solve our problem of only utilizing ~65% of memory.
-- global per-slab LRU, not one LRU for items per-size-class. this'll solve our problem of having to run the slab-page-balancer script against memcached when usage patterns change over time. (currently memcached never re-classes pages to different sized objects on demand)
-- fragment items between slabs, so a 1MB object can go over, say, 9 pages. (would be 8, but struct overhead). this is nowhere as bad as the TLB thrashing that would've been involved with our old idea (never implemented due to fear) of just using a bunch of 64-byte chunks. (the idea that "memory access is free" inspired that idea, but that's not totally true...)
-- when a new item needs to be added, free slabs, not items. will require iterating over the slab page to find the items, but we already do that at one point (during our ghetto manual page reassignment code) and thus are already guaranteeing empty parts of slabs will be zeroed out. when freeing a middle fragment of an item, the item free code will need to jump to the top, then walk the fragments, zeroing out each part on each relevant slab page.
-- if we encounter a slab that can't be free due to an item that can't be freed due to a refcount > 0 (transfer in progress), move to the next slab up the LRU and try again, up to ~20 tries. (just like we do now with the item LRU)