?

Log in

No account? Create an account
MogileFS: realtime device balancing - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

MogileFS: realtime device balancing [Nov. 22nd, 2005|05:06 pm]
Brad Fitzpatrick
[Tags|, , , ]

One of the more underplanned aspects of the LiveJournal move was not having enough disks for MogileFS data. We had enough disk space, but not enough spindles.

To make matters worse, when we'd replicated the data over the net to the new data center, it was unevenly loaded across the disks we did have. And after I did my 90-100 mph trip down the Fry's the other day to get 10 new 300 GB drives to add more spindles, that just made it even further uneven.

So now MogileFS was dealing with a problem we'd never had before... some disks were busy as fuck, and some were pretty idle. When a client asked for the location of a file, Mogile typically just randomized the list, but we needed it smarter.

First step: a weight column in the device table (shown below). Now MogileFS would do a weighted random list on that weight.

Second step: hooking up "iostat -x" to xinetd and tying a daemon to read all mogile hosts' IO activity and dynamically adjust the weights in realtime.

And woila.... the weights have stabilized at seemingly-optimal values:
+-------+--------+--------+--------+----------+---------+------------+
| devid | hostid | status | weight | mb_total | mb_used | mb_asof    |
+-------+--------+--------+--------+----------+---------+------------+
|     1 |      1 | alive  |     22 |   469336 |  232349 | 1132707629 |
|     2 |      1 | alive  |     23 |   469336 |  232558 | 1132707629 |
|     3 |      2 | alive  |     30 |   469452 |  241594 | 1132707629 |
|     4 |      3 | alive  |     20 |   469452 |  208487 | 1132707629 |
|     5 |      4 | alive  |     24 |   469452 |  243776 | 1132707629 |
|     6 |      5 | alive  |     27 |   469336 |  178196 | 1132707629 |
|     7 |      5 | alive  |     27 |   469336 |  178440 | 1132707629 |
|     8 |      6 | alive  |     23 |   469452 |  145868 | 1132707629 |
|     9 |      7 | alive  |     84 |   187785 |   13926 | 1132707629 |
|    10 |      7 | alive  |     83 |   187785 |   13931 | 1132707629 |
|    11 |      7 | alive  |     82 |   187785 |   15324 | 1132707629 |
|    12 |      7 | alive  |     86 |   187785 |   20965 | 1132707629 |
|    13 |      8 | alive  |     56 |   281675 |   68977 | 1132707629 |
|    14 |     13 | alive  |     76 |   281675 |   65579 | 1132707629 |
|    15 |     12 | alive  |     69 |   281675 |   60245 | 1132707629 |
|    16 |     11 | alive  |     63 |   281675 |   55949 | 1132707629 |
|    17 |     10 | alive  |    101 |   281675 |     104 | 1132707629 |
|    18 |      9 | alive  |     58 |   281675 |   77835 | 1132707629 |
|    19 |     14 | alive  |     68 |   281675 |   78330 | 1132707629 |
|    20 |     15 | alive  |    101 |   281675 |      95 | 1132707629 |
|    21 |     16 | alive  |    101 |   281675 |     867 | 1132707629 |
|    22 |     17 | alive  |    101 |   281675 |     849 | 1132707629 |
+-------+--------+--------+--------+----------+---------+------------+
And things are better than they have been in the past several days wrt MogileFS.

We still have a shitload of machines to rack/cable and get pooled, including our old (much bigger) MogileFS farm, which'll help a lot. But now, for the future, we have MogileFS weighting... so that's cool.
LinkReply

Comments:
[User Picture]From: xb95
2005-11-23 01:13 am (UTC)
I'm still shaking my head... but, whatever works, I guess!
(Reply) (Thread)
[User Picture]From: crw
2005-11-23 01:15 am (UTC)
a routing protocol for disks. nice.
(Reply) (Thread)
[User Picture]From: adamthebastard
2005-11-23 03:00 am (UTC)
I agree. To expand on that point further the previous disk selection method was similar to a distance-vector routing protocol (with all distances set to an equal value) where the weighted random method is more like a link-state routing protocol.

This is a very cool feature and perhaps the next step is to include network latency/bandwidth (link-state ?) in the weight generating algorithm for occasions when not all networking hardware is performing at the same speed.

link-state on wikipedia
Distance-vector on wikipedia
(Reply) (Parent) (Thread)
[User Picture]From: whitaker
2005-11-23 01:23 am (UTC)
Awesomeo.
(Reply) (Thread)
[User Picture]From: dakus
2005-11-23 01:37 am (UTC)
too bad Mr. MogileFS can't see that little blinky blinky lite on hard disks showing how busy they are...then he'd know what to do!

they still have them lights don't they?

(Reply) (Thread)
[User Picture]From: edm
2005-11-23 03:08 am (UTC)
Umm, that's pretty much what brad has done. "iostat -x" reports disk accesses (over some recent period of time), and he made that available over the network to MogileFS. Now it can see the "virtual blinking lights".

It's a very cool hack.

Ewen
(Reply) (Parent) (Thread)
(Deleted comment)
[User Picture]From: dakus
2005-11-23 04:48 am (UTC)
now I feel all stigmatized!!
(Reply) (Parent) (Thread)
[User Picture]From: xaosenkosmos
2005-11-23 05:23 am (UTC)

Obligatory pun

I feel all stigmatized

But edm was the one who had trouble seeing through your comment...
(Reply) (Parent) (Thread)
[User Picture]From: dakus
2005-11-23 04:36 am (UTC)
well that's genius then!
(Reply) (Parent) (Thread)
From: jamesd
2005-11-23 02:00 am (UTC)
Nice feature.
(Reply) (Thread)
[User Picture]From: dossy
2005-11-23 03:27 am (UTC)

automatic cascading redistribution?

Brad, my understanding of MogileFS is that it's essentially a "network RAID 1" type data distribution mechanism, right?

If some spindles are too busy while others are nearly idle, couldn't you track what assets are being accessed and replicate that data onto the spindles that are less busy? Or, is that what your weighting is doing? Weighting sounds like a way to guide new read/write operations, but doesn't help with old data, right? MogileFS is WORM, right? So, wouldn't "intelligently" redistributing previously written data have a measurable impact on performance, too?
(Reply) (Thread)
[User Picture]From: ckd
2005-11-23 03:27 am (UTC)
Profile, then optimize. I love it. Elegant!
(Reply) (Thread)
[User Picture]From: d4b
2005-11-24 04:26 am (UTC)
And after I did my 90-100 mph trip down the Fry's the other day to get 10 new 300 GB drives to add more spindles, that just made it even further uneven.

That's a funny image, you bursting into the store, out of breath, wad of cash in your fist, desperate for a three terabyte fix. (Double entendre intended.)
(Reply) (Thread)