2003-05-27 01:26 am (UTC)
Good link to DLmalloc. Didn't realize that was glibc's malloc.
BTW, Judy isn't our trouble.
We though our trouble was having a few GB of random small to large allocations. But actually it's doing fine. It's growing a little over time, and could use some work (maybe rounding all allocations to some multiple?) but it's not bad.
I think our main problem earlier was Linux not liking to mlockall() a huge process. Something made the machine die, and nothing got written to logs.
Yes, mlockall() is your problem. The 2.5 kernel is _far_ better about this scenario.
What you really want is something I term "transient memory": memory that the kernel knows it can reclaim in the event of a low-real-memory condition. I'm sure you wouldn't care about it stealing pages away so long as it tells you about it.
But there's nothing like that in the linux kernel. I'll see if I can't hack something up.
2003-05-27 08:43 am (UTC)
But how would we adjust data structures after it reclaims a page?
Locate and kill. Probably with a reverse index of some kind.
2003-05-27 09:06 am (UTC)
Not concerned about the index. I'm wondering how the kernel would notify us that it had unmapped one of our pages.
Signals. Since the transient memory area is well defined (mmap flag?), there's no possibility of you needing the unmapped pages to execute the signal handler.
Some operating systems generate a signal that applications may trap before it goes about killing them to free up memory. I think doing this in a cooperative way is better than just throwing away user pages and then telling them about it. ie, have the os ask first. This is a common practice on operating systems with vm overcommit.
Your real problem here is that you don't want this memory swapped right? It is anonymous yes? You want to lock it into memory but it isn't really important. It sounds like this problem would be better modeled by having a new pager type.
It would be neat to have a user/kernel cooperative pager type for caches like this. In BSD all memory is associated with a pager. Pagers implement the backing store for memory objects. They are used to write out dirty pages when the space needs to be reclaimed or page in pages that are required. If the kernel could call back into a user defined pager for this memory it could throw it away at will and then map it back in when you try to access that address.
That would be rad eh?
That doesn't work. You can easily create a priority inversion that way. That's why I have the kernel throwing away the LRU pages in the transient mapped region and then telling you about it after the fact.
Priority inversion between a low priority thread with a lot of memory and a high priority thread that wants memory?
Perhaps I didn't explain the interface in much detail.
Most VMs have a system of page free targets. The pageout daemons try to keep the page lists at these targets. Very few actually free memory right when they need it. That ends up leading to all sorts of issues other than priority inversion.
In any case, the signal is not a last resort. It's used when memory pressure is high but not quite fatal. You nicely ask processes to give up some memory. If that fails, you kill them and take it.
This application is a prime example of where that might be helpful. If you have a siginfo capable operating system you could even deliver information about how many pages to free, etc.
What happens if the application is following a pointer and you free it? How do you handle the segfault when you've thrown the data away? Does the app have a segfault handler that backs off on the cache? This seems problematic.
I think the SIGDANGER (AIX terminology) technique is more sound. You should look it up.
I work on AIX. Let me tell you now, SIGDANGER is a hack. Nothing more. There are plenty of cases where a process never gets to handle SIGDANGER because the SIGKILL has already been posted to it.
For a daemon like memcached, we'd probably have to start paging before memcached got a chance to run -- that's bad. I prefer the kernel to start dumping pages in LRU order as a solution, since it allows the kernel to dump pages when it needs them without having to wait for a user-mode process to dump them.
It's also preferable because the high-water-mark-based system will sometimes cause the system to dump memory unnecessarily.
What's wrong with this in your opinion?
Watermark systems are just fine when they are dynamic. I am often opposed to fixed water marks or targets where a reasonable job could be done of estimating the real requirements based on past behavior. As long as your algorithm is adaptive you will eventually reach a steady state that accurately reflects the requirements of the system. I implemented a system of dynamic targets based on user requirements and memory pressure for tuning per cpu cache sizes in FreeBSD kernel memory allocator. It ended up being extremely effective at adjusting to varying system loads.
With regards to the signal delivery.. In most kernels there is already a mechanism to temporarily raise the priority of a process so that it may receive a critical signal. Otherwise you wouldn't be able to kill something that had a very low priority. If the application is catching the DANGER signal you could raise its priority for a few time slices so that it has a chance to clean up some resources.
I can't speak for the AIX implementation but I do not feel that this concept is a hack. Many user land malloc implementations are capable of returning memory to the system now. With a system like SIGDANGER they could tune their cache sizes according to system memory pressure. This is exactly the problem we're trying to address here.
If you simply remove the pages how do you handle user faults for those addresses? I can imagine a system where the user catches SIGSEGV and has a jmpbuf to return to on fault. This sounds like it could be quite complicated and error prone though. How do you intend to handle the situation where the user accesses a page that you have freed before you have notified them that the data is gone? Keep in mind that the user could be executing any arbitrary code at the time that you decide to discard pages.
How do you intend to handle the situation where the user accesses a page that you have freed before you have notified them that the data is gone?
Since the signal interrupts any currently in-progress operations, the application should not access the pages before receiving the update. Also, any pages that are "in-use" should be marked thus by the use of mlock() or madvise() to prevent their removal.
DLmalloc is interesting. It looks like a heavily tuned buddy allocator.
As raja pointed out, a slab allocator would probably be better for your purposes. Although it sounds like fragmentation isn't the issue so I'll move along now. ;-)