Log in

No account? Create an account
InnoDB and /proc/mm - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

InnoDB and /proc/mm [Jan. 25th, 2004|09:56 pm]
Brad Fitzpatrick
[Tags|, ]

On 32 bit processors, applications can only address 4GB of memory. This includes applications like MySQL, which is unfortunate when you have machines with 6GB, 8GB, and 12GB of memory (as LiveJournal does).

IA64/Opteron/G5s... yeah, yeah... those are 64 bit, and I'm sure we'll buy some IA64 machines in the future. But we have a metric shitload of machines that are only 32 bit and can take more memory, so that's always the cheapest bang for our buck when it's upgrade time.

Normally this isn't a problem with MySQL if you're using the MyISAM table engine, since MyISAM only asks MySQL (the application) to cache its indexes in main memory. On a, say, 4GB of machine you give 750M-1.5G to MySQL (or whatever keeps index hit rates high before major diminishing returns) and the rest of the "unused" memory is used by the operating system's buffer cache.

Now, the problem comes when you want to use InnoDB with >4GB machines. InnoDB does its own buffer management (which is more precise than the kernel's, since it knows more what's going on and its access patterns, etc.). So because InnoDB wants MySQL to cache indexes and buffers, you're effectively limited to about ~2 GB for all of it. Userspace only gets ~3GB of useable space anyway, and then you have all of the MySQL thread stacks too, which you can't have overlapping the heap.

InnoDB gets around this problem on Windows by using the Windows AWE API (Address Windowing Extension) which gives Windows apps access to create/destroy/switch address spaces on PAE-capable processors (Paged Addressing Extension) which allow 36-bit addressing. This is only available in the super-pricey Windows Data Center Server or something, though.

But what about Linux?

I wrote MySQL support a request asking if they could make InnoDB support the non-standard /proc/mm interface on Linux (which User Mode Linux uses for SKAS mode to create arbitrary address spaces) like the AWE API does.

I have no time to research how hard this would be, but it seems like the tricky stuff is already done in InnoDB if it supports AWE and adding the Linux-equivalent calls in the right place shouldn't be difficult.

Of course, InnoDB should still work fine with high-memory machines, but it'll just end up falling back to the operating system's buffer cache most the time, which will end up less efficient and report inaccurate results when looking at InnoDB's buffer pool cache hit rate stats.

Anybody that knows more about this: am I on crack, or does /proc/mm seem like it'd work?

Anybody need a contract project, if the InnoDB people aren't interested?

From: jeffr
2004-01-26 02:25 am (UTC)
You want remap_file_pages() I believe.
(Reply) (Thread)
[User Picture]From: taral
2004-01-26 12:18 pm (UTC)
No, that's for mapped files. InnoDB isn't mapping files, AFAICT.
(Reply) (Parent) (Thread)
From: jeffr
2004-01-27 12:39 pm (UTC)
Is it just a shared memory db? If so it can use mmap and MAP_ANON or MAP_NOSYNC|MAP_PRIVATE with a real file. This way you're just temporarily mapping the page cache.

(Reply) (Parent) (Thread)
[User Picture]From: taral
2004-01-27 10:04 pm (UTC)
I don't see any shm or mmap calls anywhere in the innobase code...
(Reply) (Parent) (Thread)
From: jeffr
2004-01-28 01:26 am (UTC)
er, I think you're missing the point. If it's a file backed db, you could just use mmap and remap_file_pages(). If it's a memory backed db, you can still do it with an anonymous file.

I don't know anything at all about this database. I was just aware of this mechanism in linux and thought I'd suggest it.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2004-01-28 08:42 am (UTC)
Oh, I see your point. No, that doesn't help, since InnoDB does NOT want to use the kernel LRU for its memory management, and linux doesn't have any way of controlling the kernel working-set algorithms. That's why InnoDB does all its own buffering. What Brad wants is a way for InnoDB to get its hands on > 3GB of real memory.
(Reply) (Parent) (Thread)
[User Picture]From: taral
2004-01-26 12:01 pm (UTC)
Mmm... contract project... so tempting.
(Reply) (Thread)
[User Picture]From: znep
2004-01-27 12:33 am (UTC)
Note that you could probably get up to somewhere around the 3 gig range for the innodb cache if you do some tweaking... the two big things that are blocking that is how the kernel lays out the memory it has (1 gig for kernel, mmap()ed regions start at 1 gig) and the fact that, last I checked, innodb just uses malloc() and glibc doesn't allow a malloc of a region bigger than 2 gigs.

I haven't looked into what is going on in glibc, but I'm sure it can be dealt with; I'm pretty sure it is a bug and not a feature, but could be mistaken.

You can also tweak the kernel reduce the size allocated for the kernel and decrease where mmaped regions start... I have some kernels with those settings at 3.5 gigs and 128 megs respectively to let me have close to 3 gig java heaps. Well, once I got a libc that supported floating stacks so I wasn't wasting two megs of address space per thread stack.

This certainly doesn't solve the problem and isn't pretty, but may be enough to be useful in some situations and might be a lot easier to do, plus avoids the performance hit of doing PAE...

I've also thought of looking into what it would take to get support for PAE into innodb on linux, but haven't really looked into it. If only doing that with a JVM were that easy, since that is where my real fun lies.

I don't have any experience actually using PAE from userland programs on linux though. I think /proc/mm is what you want, but the interface seems a bit sketchy still. Hrm... or maybe shmfs and remap_file_pages()... I think that is what oracle uses or plans/planned on using. This starts getting into hairy VM issues really quickly where figuring out the optimal way to do things can be non-trivial.
(Reply) (Thread)
[User Picture]From: brad
2004-01-27 01:11 am (UTC)
InnoDB's response to my request was, "Well, 64 bit computers are becoming a commodity now, so it's not really worth it." They're probably right.

They also recommended the Ingo 4G/4G patch, which InnoDB users have had good experience with supposedly.

I remember reading about remap_file_pages() on LWN.net awhile back and I just reread the description now. While I think I see how it could work, I don't think it'd fit in as well with InnoDB given how they did AWE support, but I don't really know.... I just started reading the InnoDB source the other night. (which actually really impressed me... well documented.)
(Reply) (Parent) (Thread)
From: jeffr
2004-01-27 01:17 am (UTC)
Get an opteron. I have one and another is on the way. The porting is likely to go easily for you. They are comparable to Xeon's price/performance with higher performance for most memory bound apps. I highly recommend them.
(Reply) (Parent) (Thread)