?

Log in

No account? Create an account
Relocating memory allocator - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Relocating memory allocator [May. 26th, 2003|01:53 pm]
Brad Fitzpatrick
So, it turns out the memcached takes up no CPU and makes the site blazing fast (even though we're only caching a few object types so far!).

Unfortunately, it did what we suspected: died hard when its address space got fragmented. (In the meantime I restarted the daemons with much smaller sizes, accounting for the bloat which accumulates over a few hours....)

I jotted out notes for an arena allocation scheme where we only malloc large, fixed-size pages and handle our own placement and and recompaction within those pages. However, I can't believe this hasn't been done before.

I want a memory allocator that lets you register a hook when it wants to move allocations around to defragment, so you can update pointers and whatnot. Anybody know of one?
LinkReply

Comments:
From: jeffr
2003-05-28 05:34 pm (UTC)
Watermark systems are just fine when they are dynamic. I am often opposed to fixed water marks or targets where a reasonable job could be done of estimating the real requirements based on past behavior. As long as your algorithm is adaptive you will eventually reach a steady state that accurately reflects the requirements of the system. I implemented a system of dynamic targets based on user requirements and memory pressure for tuning per cpu cache sizes in FreeBSD kernel memory allocator. It ended up being extremely effective at adjusting to varying system loads.

With regards to the signal delivery.. In most kernels there is already a mechanism to temporarily raise the priority of a process so that it may receive a critical signal. Otherwise you wouldn't be able to kill something that had a very low priority. If the application is catching the DANGER signal you could raise its priority for a few time slices so that it has a chance to clean up some resources.

I can't speak for the AIX implementation but I do not feel that this concept is a hack. Many user land malloc implementations are capable of returning memory to the system now. With a system like SIGDANGER they could tune their cache sizes according to system memory pressure. This is exactly the problem we're trying to address here.

If you simply remove the pages how do you handle user faults for those addresses? I can imagine a system where the user catches SIGSEGV and has a jmpbuf to return to on fault. This sounds like it could be quite complicated and error prone though. How do you intend to handle the situation where the user accesses a page that you have freed before you have notified them that the data is gone? Keep in mind that the user could be executing any arbitrary code at the time that you decide to discard pages.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-05-28 05:56 pm (UTC)
How do you intend to handle the situation where the user accesses a page that you have freed before you have notified them that the data is gone?

Since the signal interrupts any currently in-progress operations, the application should not access the pages before receiving the update. Also, any pages that are "in-use" should be marked thus by the use of mlock() or madvise() to prevent their removal.
(Reply) (Parent) (Thread)
From: jeffr
2003-05-28 06:05 pm (UTC)
That adds two system calls for every access to a cached object. I guess the application could be written with a two tiered cache. The first tier could be locked always and the second would have to be migrated to the first and locked before use.

Even so, this seems like a very imposing interface. You have to page align all objects and all objects must be page size. Otherwise you pin down unrelated objects.

It still seems to me that the signal is more flexible and less intrusive. It's also less code.

It's your deal though, have fun with it.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-05-28 10:07 pm (UTC)
Oh, sure. The signal is more flexible and less intrusive. It's also less effective. I'll do a few experiments on my end and tell you what I end up with.
(Reply) (Parent) (Thread)
From: jeffr
2003-05-29 11:16 am (UTC)

I do like to beat a dead horse.

By effectiveness I assume you mean the amount of memory reclaimed.

In the signal based solution the memory reclaimed is limited by the application because it has to voluntarily give it up.

In the other solution the memory reclaimed is limited by the application because it has to mark it as reclaimable and unlock it.

The second solution also seems more costly in the kernel. You have to dig through your vm datastructures finding something suitable to remove and then make sure you can remove it. So you also maintain the idea of pages that can be freed in two places instead of one.

This also restricts you to a coarse grain LRU page replacement policy. If the page reclaiming is done in the application it can exploit application specific knowledge to build a free list that is smarter than LRU. It is coarse grain because most VMs don't reset the accessed bits on the page with any great frequency while in the application perfect LRU order could be maintained by simply using a tail queue.

Anyway, I'm sure both solutions will work well enough. I've already bikesheded this discussion enough. I'll try to find some other outlet for my post coffee programming discussion frenzy.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-05-29 03:14 pm (UTC)

So do I

Actually, effectiveness is "probability of avoiding swapping out data".

Yes, bikeshed indeed...
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-05-28 06:03 pm (UTC)
Grr. It ate my comment.

How do you intend to handle the situation where the user accesses a page that you have freed before you have notified them that the data is gone?

By a process of pinning. The user uses mlock() or madvise() to pin the pages it needs to work on. When it's done, it unpins them. Pinned pages are not considered for reclaim by this system.
(Reply) (Parent) (Thread)