?

Log in

No account? Create an account
MogileFS: realtime device balancing - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

MogileFS: realtime device balancing [Nov. 22nd, 2005|05:06 pm]
Brad Fitzpatrick
[Tags|, , , ]

One of the more underplanned aspects of the LiveJournal move was not having enough disks for MogileFS data. We had enough disk space, but not enough spindles.

To make matters worse, when we'd replicated the data over the net to the new data center, it was unevenly loaded across the disks we did have. And after I did my 90-100 mph trip down the Fry's the other day to get 10 new 300 GB drives to add more spindles, that just made it even further uneven.

So now MogileFS was dealing with a problem we'd never had before... some disks were busy as fuck, and some were pretty idle. When a client asked for the location of a file, Mogile typically just randomized the list, but we needed it smarter.

First step: a weight column in the device table (shown below). Now MogileFS would do a weighted random list on that weight.

Second step: hooking up "iostat -x" to xinetd and tying a daemon to read all mogile hosts' IO activity and dynamically adjust the weights in realtime.

And woila.... the weights have stabilized at seemingly-optimal values:
+-------+--------+--------+--------+----------+---------+------------+
| devid | hostid | status | weight | mb_total | mb_used | mb_asof    |
+-------+--------+--------+--------+----------+---------+------------+
|     1 |      1 | alive  |     22 |   469336 |  232349 | 1132707629 |
|     2 |      1 | alive  |     23 |   469336 |  232558 | 1132707629 |
|     3 |      2 | alive  |     30 |   469452 |  241594 | 1132707629 |
|     4 |      3 | alive  |     20 |   469452 |  208487 | 1132707629 |
|     5 |      4 | alive  |     24 |   469452 |  243776 | 1132707629 |
|     6 |      5 | alive  |     27 |   469336 |  178196 | 1132707629 |
|     7 |      5 | alive  |     27 |   469336 |  178440 | 1132707629 |
|     8 |      6 | alive  |     23 |   469452 |  145868 | 1132707629 |
|     9 |      7 | alive  |     84 |   187785 |   13926 | 1132707629 |
|    10 |      7 | alive  |     83 |   187785 |   13931 | 1132707629 |
|    11 |      7 | alive  |     82 |   187785 |   15324 | 1132707629 |
|    12 |      7 | alive  |     86 |   187785 |   20965 | 1132707629 |
|    13 |      8 | alive  |     56 |   281675 |   68977 | 1132707629 |
|    14 |     13 | alive  |     76 |   281675 |   65579 | 1132707629 |
|    15 |     12 | alive  |     69 |   281675 |   60245 | 1132707629 |
|    16 |     11 | alive  |     63 |   281675 |   55949 | 1132707629 |
|    17 |     10 | alive  |    101 |   281675 |     104 | 1132707629 |
|    18 |      9 | alive  |     58 |   281675 |   77835 | 1132707629 |
|    19 |     14 | alive  |     68 |   281675 |   78330 | 1132707629 |
|    20 |     15 | alive  |    101 |   281675 |      95 | 1132707629 |
|    21 |     16 | alive  |    101 |   281675 |     867 | 1132707629 |
|    22 |     17 | alive  |    101 |   281675 |     849 | 1132707629 |
+-------+--------+--------+--------+----------+---------+------------+
And things are better than they have been in the past several days wrt MogileFS.

We still have a shitload of machines to rack/cable and get pooled, including our old (much bigger) MogileFS farm, which'll help a lot. But now, for the future, we have MogileFS weighting... so that's cool.
LinkReply

Comments:
[User Picture]From: dossy
2005-11-23 03:27 am (UTC)

automatic cascading redistribution?

Brad, my understanding of MogileFS is that it's essentially a "network RAID 1" type data distribution mechanism, right?

If some spindles are too busy while others are nearly idle, couldn't you track what assets are being accessed and replicate that data onto the spindles that are less busy? Or, is that what your weighting is doing? Weighting sounds like a way to guide new read/write operations, but doesn't help with old data, right? MogileFS is WORM, right? So, wouldn't "intelligently" redistributing previously written data have a measurable impact on performance, too?
(Reply) (Thread)