April 8th, 2004

belize

linux source

Tried to sleep a couple times but now for some reason I'm trying to read the source to Linux. It's surprisingly well documented.

In particular, I'm trying to figure what's responsible for freeing slab items of type "nfs_inode_cache" and "dentry_cache". Earlier today Lisa gave me the mystery of figuring out why two identical machines (hardware, software, roles, weighting) showed massively different amounts of free memory. (120 MB vs 500 MB). I looked at /proc/slabinfo and saw nfs_inode_cache and dentry_cache incredibly high on one, which I immediately recognized as find(1)'s fault, which was the only machine of ours with that not disabled.

So while I had confidence those cache items in the slabs would be reclaimed when necessary, I wanted to see the magic.

I've been looking at the slab, NFS, fs, and now VM code. Think I found what I was looking for:

mm/vmscan.c:try_to_free_pages -> shrink_slab

It's pretty cool... there's a "shrinker_list" of caches that can be shrunk with callbacks for the VM.

(easily amused at 2am)

So from there:

fs/dcache.c: set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
fs/inode.c: set_shrinker(DEFAULT_SEEKS, shrink_icache_memory);

Now I'm curious if there's a way to invoke try_to_free_pages by hand, without waiting for the system memory to get low. I recognize it's not very useful, but I'm just curious.

I should sleep though or I won't get to work on time and be able wear my employee of the month shirt for the ~5th month in a row come May 1st.
belize

memcached and linux

I found this thread interesting:

Large slab cache in 2.6.1
http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/index.html

My understanding at this point:

Linux has an LRU for cache items, but the cache items are spread across pages all over the slab.

When the VM needs memory, it pressures the caches into shrinking themselves to give up pages. But since they're just freeing off their LRU, they incredibly often don't even give up pages, they just free up slab items across a bunch of misc pages.

Experiments were apparently done to actually LRU the slab pages themselves, and not the items, but since they had no way apparently to actually free all items on the slab page (why? locking?) it only effectively added reclaim pressure from the vfs code, so they just did that instead. (or something)

This is the best analysis of the problem:

http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/1908.html
> That was Ed. Because we cannot reclaim
> slab pages direct from the LRU

I think that this is needed: Bonwick's slab algorithm (i.e. two-level linked lists, implemented in cache_alloc_refill and free_block) is intended for unfreeable objects.
The dentry cache is a cache of freeable objects - a different algorithm would be more efficient for shrinking the dentry cache after an updatedb.
I had started prototyping, but didn't get far.
--
Manfred
So I think this is all pretty lame, but memcached has exactly the same problem. (trying to use a slab allocator as a cache, and freeing objects, not pages...)

So this has me thinking about memcached improvements:

-- instead of 1MB slab pages, reduce that to, say, 128k

-- higher granularity size classes (not just powers of two). this'll solve our problem of only utilizing ~65% of memory.

-- global per-slab LRU, not one LRU for items per-size-class. this'll solve our problem of having to run the slab-page-balancer script against memcached when usage patterns change over time. (currently memcached never re-classes pages to different sized objects on demand)

-- fragment items between slabs, so a 1MB object can go over, say, 9 pages. (would be 8, but struct overhead). this is nowhere as bad as the TLB thrashing that would've been involved with our old idea (never implemented due to fear) of just using a bunch of 64-byte chunks. (the idea that "memory access is free" inspired that idea, but that's not totally true...)

-- when a new item needs to be added, free slabs, not items. will require iterating over the slab page to find the items, but we already do that at one point (during our ghetto manual page reassignment code) and thus are already guaranteeing empty parts of slabs will be zeroed out. when freeing a middle fragment of an item, the item free code will need to jump to the top, then walk the fragments, zeroing out each part on each relevant slab page.

-- if we encounter a slab that can't be free due to an item that can't be freed due to a refcount > 0 (transfer in progress), move to the next slab up the LRU and try again, up to ~20 tries. (just like we do now with the item LRU)

avva/others... thoughts?