| MogileFS: realtime device balancing |
[Nov. 22nd, 2005|05:06 pm] |
One of the more underplanned aspects of the LiveJournal move was not having enough disks for MogileFS data. We had enough disk space, but not enough spindles.
To make matters worse, when we'd replicated the data over the net to the new data center, it was unevenly loaded across the disks we did have. And after I did my 90-100 mph trip down the Fry's the other day to get 10 new 300 GB drives to add more spindles, that just made it even further uneven.
So now MogileFS was dealing with a problem we'd never had before... some disks were busy as fuck, and some were pretty idle. When a client asked for the location of a file, Mogile typically just randomized the list, but we needed it smarter.
First step: a weight column in the device table (shown below). Now MogileFS would do a weighted random list on that weight.
Second step: hooking up "iostat -x" to xinetd and tying a daemon to read all mogile hosts' IO activity and dynamically adjust the weights in realtime.
And woila.... the weights have stabilized at seemingly-optimal values:+-------+--------+--------+--------+----------+---------+------------+
| devid | hostid | status | weight | mb_total | mb_used | mb_asof |
+-------+--------+--------+--------+----------+---------+------------+
| 1 | 1 | alive | 22 | 469336 | 232349 | 1132707629 |
| 2 | 1 | alive | 23 | 469336 | 232558 | 1132707629 |
| 3 | 2 | alive | 30 | 469452 | 241594 | 1132707629 |
| 4 | 3 | alive | 20 | 469452 | 208487 | 1132707629 |
| 5 | 4 | alive | 24 | 469452 | 243776 | 1132707629 |
| 6 | 5 | alive | 27 | 469336 | 178196 | 1132707629 |
| 7 | 5 | alive | 27 | 469336 | 178440 | 1132707629 |
| 8 | 6 | alive | 23 | 469452 | 145868 | 1132707629 |
| 9 | 7 | alive | 84 | 187785 | 13926 | 1132707629 |
| 10 | 7 | alive | 83 | 187785 | 13931 | 1132707629 |
| 11 | 7 | alive | 82 | 187785 | 15324 | 1132707629 |
| 12 | 7 | alive | 86 | 187785 | 20965 | 1132707629 |
| 13 | 8 | alive | 56 | 281675 | 68977 | 1132707629 |
| 14 | 13 | alive | 76 | 281675 | 65579 | 1132707629 |
| 15 | 12 | alive | 69 | 281675 | 60245 | 1132707629 |
| 16 | 11 | alive | 63 | 281675 | 55949 | 1132707629 |
| 17 | 10 | alive | 101 | 281675 | 104 | 1132707629 |
| 18 | 9 | alive | 58 | 281675 | 77835 | 1132707629 |
| 19 | 14 | alive | 68 | 281675 | 78330 | 1132707629 |
| 20 | 15 | alive | 101 | 281675 | 95 | 1132707629 |
| 21 | 16 | alive | 101 | 281675 | 867 | 1132707629 |
| 22 | 17 | alive | 101 | 281675 | 849 | 1132707629 |
+-------+--------+--------+--------+----------+---------+------------+ And things are better than they have been in the past several days wrt MogileFS.
We still have a shitload of machines to rack/cable and get pooled, including our old (much bigger) MogileFS farm, which'll help a lot. But now, for the future, we have MogileFS weighting... so that's cool. |
|
|
| Comments: |
![[User Picture]](http://l-userpic.livejournal.com/4721253/164057) | From: xb95 2005-11-23 01:13 am (UTC)
| (Link)
|
I'm still shaking my head... but, whatever works, I guess!
![[User Picture]](http://l-userpic.livejournal.com/27339622/6317759) | From: crw 2005-11-23 01:15 am (UTC)
| (Link)
|
a routing protocol for disks. nice.
I agree. To expand on that point further the previous disk selection method was similar to a distance-vector routing protocol (with all distances set to an equal value) where the weighted random method is more like a link-state routing protocol. This is a very cool feature and perhaps the next step is to include network latency/bandwidth (link-state ?) in the weight generating algorithm for occasions when not all networking hardware is performing at the same speed. link-state on wikipedia Distance-vector on wikipedia
too bad Mr. MogileFS can't see that little blinky blinky lite on hard disks showing how busy they are...then he'd know what to do!
they still have them lights don't they?
![[User Picture]](http://l-userpic.livejournal.com/4980108/1052633) | From: edm 2005-11-23 03:08 am (UTC)
| (Link)
|
Umm, that's pretty much what brad has done. "iostat -x" reports disk accesses (over some recent period of time), and he made that available over the network to MogileFS. Now it can see the "virtual blinking lights". It's a very cool hack. Ewen
-1, taking a dakus comment in brad's journal seriously. ;)
now I feel all stigmatized!!
I feel all stigmatized But edm was the one who had trouble seeing through your comment...
![[User Picture]](http://l-userpic.livejournal.com/31650921/2537079) | From: dossy 2005-11-23 03:27 am (UTC)
automatic cascading redistribution? | (Link)
|
Brad, my understanding of MogileFS is that it's essentially a "network RAID 1" type data distribution mechanism, right?
If some spindles are too busy while others are nearly idle, couldn't you track what assets are being accessed and replicate that data onto the spindles that are less busy? Or, is that what your weighting is doing? Weighting sounds like a way to guide new read/write operations, but doesn't help with old data, right? MogileFS is WORM, right? So, wouldn't "intelligently" redistributing previously written data have a measurable impact on performance, too?
![[User Picture]](http://l-userpic.livejournal.com/12072952/2346955) | From: ckd 2005-11-23 03:27 am (UTC)
| (Link)
|
Profile, then optimize. I love it. Elegant!
![[User Picture]](http://l-userpic.livejournal.com/80579175/2478) | From: d4b 2005-11-24 04:26 am (UTC)
| (Link)
|
And after I did my 90-100 mph trip down the Fry's the other day to get 10 new 300 GB drives to add more spindles, that just made it even further uneven.
That's a funny image, you bursting into the store, out of breath, wad of cash in your fist, desperate for a three terabyte fix. (Double entendre intended.) | |