May 1st, 2004



Happy May!

Nothing much interesting to report besides that. Oh, I'll try:

Dina's working all weekend, including tomorrow pretty early, so the Saturday morning ritual-breakfast at Reedville Cafe is in jeopardy.... skip it, or wake up really early? Tough.

Not sure what I'm doing all weekend. Rearranging my home office and garage intermixed with some Perlbal / MogileFS, maybe? And some mind-numbing TV? Sounds like a plan. Might go to my parents' beach place for a bit, where they're finally done remodeling apparently. And some biking trails around there?

I eagerly await device-mapper in Linux 2.6 to stabilize.

Don't feel like sleeping.

MySQL data w/ MogileFS-like distribution

There are two things that suck about LJ's architecture nowadays:

-- The obvious one: the global master is still a single point of failure. Since we have hardly any data on it, it's a perfect candidate for putting on MySQL cluster once it matures. So this isn't a big concern of mine. Also, we can switch global masters manually pretty easily.

-- Second problem, more of a cost/maintenance issue: we have 2 databases for each cluster, only one of which is every in use. So for N clusters, we have N active machines and N spares. (2N machines!) I'd rather have 1 or 2 spares and have N+{1,2} machines.

Related to the second problem is that we have ~10 databases per machine (each like 8-20GB, iirc). Breaking it down like this makes it easier to move 1/10th of the load/disk between machines easily. The problem is that it's not granular enough.

Ideally, each user would have his/her own database which is maintained on 2+ random machines ala MogileFS. Then if we lose a node, application connections just connect to the other one.

Rebalancing load/disk would be quick/easy... simple filesystem copies between machines. Or just changing the weighting of hosts that the application picks (because a user is always on two machines).

Now, reasons why this is a little tricky:

-- currently, data never changes in MogileFS. so, to make this work with databases, the servers tracking what's where need to know which copy received the last writes, how much, and where all the other guys are at as far as replication. when the application needs a handle, it must not write to an out-of-date copy.

-- mysql replication is too limited for this. you can't setup arbitrary replication links on the fly between arbitrary databases. you can also only have one per-machine binlog. I'd want one per database so I could purge it when other hosts caught up.

-- directory layout... most filesystems can't deal with millions of files/directories in the same directory. there's been improvement lately, but it's still pretty sad. mysql would need support for specifying the path to a database in its database table so the directories can all be hashed out.

-- 3M+ databases instead of ~90 means schema upgrades shouldn't happen during application upgrades, but rather on-the-fly when the application requested a database handle, since the individual databases will be so small.

In the end I envision:

-- just buying 1 more machine (not 2) whenever load gets high or disk gets full.

-- a higher percentage of our total DB machines doing useful work (like 100% instead of current 95%....) but when one dies, the rest are just strained a bit (but evenly)

-- no master/relation relation between any machines.

-- binlogs per database

-- any machine can go down and the application will just use another copy of that user's database until the others come back up and catch up.

-- if a second copy is lost, it can be restored by copying the other copy's data after locking access to the database and flushing tables.

-- automatic load-balancing and disk distribution

-- moving users by copying their data files (rsync -aW) instead of sql.

-- easy per-user backups

-- quick schema upgrades

-- a mysqld proxy that looks like mysql, but really just takes your connection requests, looks at the database, and does all the above magic.

It all seems rather attainable. The MySQL directory hashing thing is easy and we could just sponsor them to do it. The MySQL binlog thing could be worked-around, even: another task parsing the global binlog and splitting it up per-user.