?

Log in

No account? Create an account
brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

new memcached memory layout [Oct. 18th, 2003|07:21 pm]
Brad Fitzpatrick
Current memcached memory management:
http://lists.danga.com/pipermail/memcached/2003-September/000214.html

New proposal:
http://lists.danga.com/pipermail/memcached/2003-October/000302.html

If you've read doc/memory_management.txt in the memcached release, you'll
know that currently we keep a dozen or so different LRUs, one for each
slab class size (64 bytes, 128 bytes, .... 256k, 512k, 1MB).

And then we have slabs of each size.

Once a 1MB page is allocated to a size class and cut up, it's never used
for any other size.

The problem is: if the application changes over time, or if you enable
compression, it's likely one of your size classes will get
disproportionately sized... objects living too long or too short compared
to other objects.

Ideally you'd have one LRU.

Also not ideal about slabs of size 2^n is memory efficiency. Empirical
evidence (41 GB of LiveJournal memcached data) shows that 30-35% of the
memory is wasted. (if you want to store a 129 byte object, it has to go
in the 256 byte class.)

I propose all objects are stored as a linked list of 64 byte chunks.

So we have one LRU, and one slab size class.

When an object rolls off the LRU, or when it's deleted or expired, all its
chunks are freed.

The top efficiency for this is 60/64, or 93.7%. (or for 64 bit computers,
87.5%)

The whole reason we did a slab allocator was to avoid fragmentation, which
makes it hard to allocate a contiguous region after running for hours.
Early versions of memcached would hang after hours, with glibc's malloc
totally confused.

But fragmentation isn't important. It is for disks, where seek times are
terrible, but in memory we could put things wherever and it's always fast.

Pros: much better memory use, one LRU (so no slab class reassignment logic)

Cons:

The main complication is long keys, where the internal structure for the
item is longer than 64 bytes. But that's already at the end, and we
already have macros to get to it. The important stuff in the struct
(refcounts, etc) will always be in the first chunk, so we can continue
casting that first chunk to the item* struct.

More CPU, but this thing already takes no CPU, so I'm not concerned about
that at all.

- Brad
LinkReply

Comments:
[User Picture]From: whitaker
2003-10-19 12:28 am (UTC)
Heh, after reading the proposed solution, it seems like such an obviously good way to do it... Funny how that always seems to work out.
(Reply) (Thread)
From: evan
2003-10-19 11:31 am (UTC)
i was just thinking about the memcached memory stuff last night, because i was reading that swarm book (i think the idea is it's a new approach for solving optimization problems-- i know toast did some toy apps with "ants" and i think he may have used in some chemistry research, too).

it seems that you could tune the slab sizes for a given dataset and that would give you the best of both worlds. it's "just" another optimization problem, no? i actually even cvs updated memcached to begin looking at how hard it would be to change the slab sizes based on some command-line parameters.

whatever happened to your analysis of that? did it ever go anywhere? the main drawback i saw was that weird-sized pieces might not fit together nicely in 1mb chunks or whatever. and, of course, that you'd have to reboot your memcaches whenever you tuned the parameters (though maybe you could do something really tricky at runtime, i'm not sure it'd be worth it).

as for the linked list: interesting idea. you definitely are sacrificing things you don't need (cpu) for things you do (memory efficiency). aside from cpu, you also lose all of your caches, but memory bandwidth probably won't be an issue until network bandwidth increases. making the chunk size tunable would also have the effect of one fixed-size slab-- if i know i'm always gonna have 1mb chunks, i could just tell memcached to make the list elements that big (well, with an adjustment for bookkeeping data).
(Reply) (Thread)
[User Picture]From: brad
2003-10-19 11:40 am (UTC)
I did the whole slab size optimization thing, and it was slow. And our application changed too often. Restarting memcached isn't an option. Plus, my goal for memcached is that administering it is dead simple: you shouldn't need to know about "slab caches". You should just start it and ignore it forever.

I don't think L1 caches are an issue here, either. Why do you think that?
(Reply) (Parent) (Thread)
From: evan
2003-10-19 12:15 pm (UTC)
my memory of caches is kinda hazy. i was under the impression cache lines were larger, on the order of like 1k, but i guess they're only 64 bytes anyway.

the reason i was worrying is that when you're pulling a k of data out of your linked-list scheme, you're (potentially) jumping all over memory, while sequential reads could hit your cache. but if you're using 64 byte chunks anyway, i guess my worry wasn't useful. (and in any case, memory bandwidth isn't your bottleneck...)
(Reply) (Parent) (Thread)
[User Picture]From: toast0
2003-10-19 02:11 pm (UTC)
I had the same general impression as evan about cache line sizes. The reason I brought it up (on the list) was that for reads that require access to ram, the cpu has to sit around waiting for the ram, in sequential reads, pretty much only the first read is gonna have to wait for the ram, the cache is going to grab everything else before it's requested. This probably won't end up being a big performance hit, but it's likely to be noticable. The significant increase in memory allocation efficiency is probably more important than the marginal loss in performance.

(Reply) (Parent) (Thread)