Log in

No account? Create an account
Relocating memory allocator - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Relocating memory allocator [May. 26th, 2003|01:53 pm]
Brad Fitzpatrick
So, it turns out the memcached takes up no CPU and makes the site blazing fast (even though we're only caching a few object types so far!).

Unfortunately, it did what we suspected: died hard when its address space got fragmented. (In the meantime I restarted the daemons with much smaller sizes, accounting for the bloat which accumulates over a few hours....)

I jotted out notes for an arena allocation scheme where we only malloc large, fixed-size pages and handle our own placement and and recompaction within those pages. However, I can't believe this hasn't been done before.

I want a memory allocator that lets you register a hook when it wants to move allocations around to defragment, so you can update pointers and whatnot. Anybody know of one?

[User Picture]From: taral
2003-05-27 09:43 pm (UTC)
That doesn't work. You can easily create a priority inversion that way. That's why I have the kernel throwing away the LRU pages in the transient mapped region and then telling you about it after the fact.
(Reply) (Parent) (Thread)
From: jeffr
2003-05-28 12:46 am (UTC)
Priority inversion between a low priority thread with a lot of memory and a high priority thread that wants memory?

Perhaps I didn't explain the interface in much detail.

Most VMs have a system of page free targets. The pageout daemons try to keep the page lists at these targets. Very few actually free memory right when they need it. That ends up leading to all sorts of issues other than priority inversion.

In any case, the signal is not a last resort. It's used when memory pressure is high but not quite fatal. You nicely ask processes to give up some memory. If that fails, you kill them and take it.

This application is a prime example of where that might be helpful. If you have a siginfo capable operating system you could even deliver information about how many pages to free, etc.

What happens if the application is following a pointer and you free it? How do you handle the segfault when you've thrown the data away? Does the app have a segfault handler that backs off on the cache? This seems problematic.

I think the SIGDANGER (AIX terminology) technique is more sound. You should look it up.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-05-28 03:09 pm (UTC)
I work on AIX. Let me tell you now, SIGDANGER is a hack. Nothing more. There are plenty of cases where a process never gets to handle SIGDANGER because the SIGKILL has already been posted to it.

For a daemon like memcached, we'd probably have to start paging before memcached got a chance to run -- that's bad. I prefer the kernel to start dumping pages in LRU order as a solution, since it allows the kernel to dump pages when it needs them without having to wait for a user-mode process to dump them.

It's also preferable because the high-water-mark-based system will sometimes cause the system to dump memory unnecessarily.

What's wrong with this in your opinion?
(Reply) (Parent) (Thread)
From: jeffr
2003-05-28 05:34 pm (UTC)
Watermark systems are just fine when they are dynamic. I am often opposed to fixed water marks or targets where a reasonable job could be done of estimating the real requirements based on past behavior. As long as your algorithm is adaptive you will eventually reach a steady state that accurately reflects the requirements of the system. I implemented a system of dynamic targets based on user requirements and memory pressure for tuning per cpu cache sizes in FreeBSD kernel memory allocator. It ended up being extremely effective at adjusting to varying system loads.

With regards to the signal delivery.. In most kernels there is already a mechanism to temporarily raise the priority of a process so that it may receive a critical signal. Otherwise you wouldn't be able to kill something that had a very low priority. If the application is catching the DANGER signal you could raise its priority for a few time slices so that it has a chance to clean up some resources.

I can't speak for the AIX implementation but I do not feel that this concept is a hack. Many user land malloc implementations are capable of returning memory to the system now. With a system like SIGDANGER they could tune their cache sizes according to system memory pressure. This is exactly the problem we're trying to address here.

If you simply remove the pages how do you handle user faults for those addresses? I can imagine a system where the user catches SIGSEGV and has a jmpbuf to return to on fault. This sounds like it could be quite complicated and error prone though. How do you intend to handle the situation where the user accesses a page that you have freed before you have notified them that the data is gone? Keep in mind that the user could be executing any arbitrary code at the time that you decide to discard pages.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-05-28 05:56 pm (UTC)
How do you intend to handle the situation where the user accesses a page that you have freed before you have notified them that the data is gone?

Since the signal interrupts any currently in-progress operations, the application should not access the pages before receiving the update. Also, any pages that are "in-use" should be marked thus by the use of mlock() or madvise() to prevent their removal.
(Reply) (Parent) (Thread)
From: jeffr
2003-05-28 06:05 pm (UTC)
That adds two system calls for every access to a cached object. I guess the application could be written with a two tiered cache. The first tier could be locked always and the second would have to be migrated to the first and locked before use.

Even so, this seems like a very imposing interface. You have to page align all objects and all objects must be page size. Otherwise you pin down unrelated objects.

It still seems to me that the signal is more flexible and less intrusive. It's also less code.

It's your deal though, have fun with it.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-05-28 10:07 pm (UTC)
Oh, sure. The signal is more flexible and less intrusive. It's also less effective. I'll do a few experiments on my end and tell you what I end up with.
(Reply) (Parent) (Thread)
From: jeffr
2003-05-29 11:16 am (UTC)

I do like to beat a dead horse.

By effectiveness I assume you mean the amount of memory reclaimed.

In the signal based solution the memory reclaimed is limited by the application because it has to voluntarily give it up.

In the other solution the memory reclaimed is limited by the application because it has to mark it as reclaimable and unlock it.

The second solution also seems more costly in the kernel. You have to dig through your vm datastructures finding something suitable to remove and then make sure you can remove it. So you also maintain the idea of pages that can be freed in two places instead of one.

This also restricts you to a coarse grain LRU page replacement policy. If the page reclaiming is done in the application it can exploit application specific knowledge to build a free list that is smarter than LRU. It is coarse grain because most VMs don't reset the accessed bits on the page with any great frequency while in the application perfect LRU order could be maintained by simply using a tail queue.

Anyway, I'm sure both solutions will work well enough. I've already bikesheded this discussion enough. I'll try to find some other outlet for my post coffee programming discussion frenzy.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-05-29 03:14 pm (UTC)

So do I

Actually, effectiveness is "probability of avoiding swapping out data".

Yes, bikeshed indeed...
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-05-28 06:03 pm (UTC)
Grr. It ate my comment.

How do you intend to handle the situation where the user accesses a page that you have freed before you have notified them that the data is gone?

By a process of pinning. The user uses mlock() or madvise() to pin the pages it needs to work on. When it's done, it unpins them. Pinned pages are not considered for reclaim by this system.
(Reply) (Parent) (Thread)