January 25th, 2004



papag:parrot $ make
Compiling with:
cc -D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -g -Dan_Sugalski -Larry -Wall -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Waggregate-return -Winline -W -Wno-unused -Wsign-compare -Wformat-nonliteral -Wformat-security -Wpacked -Wdisabled-optimization -mno-accumulate-outgoing-args -Wno-shadow -I./include -DHAS_JIT -DI386 -DHAVE_COMPUTED_GOTO xx.o -c xx.c


InnoDB and /proc/mm

On 32 bit processors, applications can only address 4GB of memory. This includes applications like MySQL, which is unfortunate when you have machines with 6GB, 8GB, and 12GB of memory (as LiveJournal does).

IA64/Opteron/G5s... yeah, yeah... those are 64 bit, and I'm sure we'll buy some IA64 machines in the future. But we have a metric shitload of machines that are only 32 bit and can take more memory, so that's always the cheapest bang for our buck when it's upgrade time.

Normally this isn't a problem with MySQL if you're using the MyISAM table engine, since MyISAM only asks MySQL (the application) to cache its indexes in main memory. On a, say, 4GB of machine you give 750M-1.5G to MySQL (or whatever keeps index hit rates high before major diminishing returns) and the rest of the "unused" memory is used by the operating system's buffer cache.

Now, the problem comes when you want to use InnoDB with >4GB machines. InnoDB does its own buffer management (which is more precise than the kernel's, since it knows more what's going on and its access patterns, etc.). So because InnoDB wants MySQL to cache indexes and buffers, you're effectively limited to about ~2 GB for all of it. Userspace only gets ~3GB of useable space anyway, and then you have all of the MySQL thread stacks too, which you can't have overlapping the heap.

InnoDB gets around this problem on Windows by using the Windows AWE API (Address Windowing Extension) which gives Windows apps access to create/destroy/switch address spaces on PAE-capable processors (Paged Addressing Extension) which allow 36-bit addressing. This is only available in the super-pricey Windows Data Center Server or something, though.

But what about Linux?

I wrote MySQL support a request asking if they could make InnoDB support the non-standard /proc/mm interface on Linux (which User Mode Linux uses for SKAS mode to create arbitrary address spaces) like the AWE API does.

I have no time to research how hard this would be, but it seems like the tricky stuff is already done in InnoDB if it supports AWE and adding the Linux-equivalent calls in the right place shouldn't be difficult.

Of course, InnoDB should still work fine with high-memory machines, but it'll just end up falling back to the operating system's buffer cache most the time, which will end up less efficient and report inaccurate results when looking at InnoDB's buffer pool cache hit rate stats.

Anybody that knows more about this: am I on crack, or does /proc/mm seem like it'd work?

Anybody need a contract project, if the InnoDB people aren't interested?