?

Log in

No account? Create an account
mogilefs maintenance - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

mogilefs maintenance [Jun. 4th, 2006|07:42 pm]
Brad Fitzpatrick
[Tags|, , ]

I've gutted the 3,500 line mogilefsd server and broke it into manageable (and much cleaner!) pieces:
   749   2934  23640 mogilefsd
  1073   4108  35709 lib/MogileFS/Worker/Query.pm
   357   1561  13546 lib/MogileFS/Worker/Replicate.pm
   114    372   3436 lib/MogileFS/Worker/Monitor.pm
    99    316   2691 lib/MogileFS/Worker/Reaper.pm
   137    519   4810 lib/MogileFS/Worker/Delete.pm
    47    100    778 lib/MogileFS/Worker.pm
    52    158   1172 lib/MogileFS/Connection/Client.pm
    84    242   2039 lib/MogileFS/Connection/Worker.pm
   665   3016  23216 lib/MogileFS/ProcManager.pm
   144    504   5293 lib/MogileFS/Config.pm
    13     34    242 lib/MogileFS/Sys.pm
   112    380   2719 lib/MogileFS/Util.pm
  3646  14244 119291 total
Damn that feels good.

What also feels good is having Xen this time around. When we originally wrote MogileFS all our testing was on physical boxes. Now I can run a whole Mogile farm on a single box.

More importantly, though, is improved error testing.

There are two major types of errors in a distributed appication:

1) immediate error (get a RST on connect, or some other protocol-level error). basically you find out about the error almost immediately after you do your action because the remote machine is alive and responding, just not how you want.

2) talking to a black hole. machine is off the net and you just have to timeout and assume it's dead after enough time goes by.

Testing and designing for #1 is easy.

The blackhole case is where it gets tricky. Before Xen, the only easy way to test #2 was pulling ethernet cables. Yes, you could also do firewall rules and such, but people do it a lot less often. If you look at people's code (including mine for a long time), all too often they assume the failure will happen immediately.

Long story short: I'm pretty sure MogileFS used to be blackhole-error safe, I definitely remember testing it, but it's not in a test suite that automates testing of such errors, and that's the real problem. As a result, we've regressed in certain protocol modes which now hang way too long before making progress.

Testing blackholes with Xen is so much easier than firewall rules:
sammy:/home/bradfitz# xm unpause mog2
sammy:/home/bradfitz# xm pause mog2
sammy:/home/bradfitz# xm unpause mog2
sammy:/home/bradfitz# xm pause mog2
sammy:/home/bradfitz# xm unpause mog2
And then watch logs as MogileFS detects the different classes of errors:
[monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2:  took 2.00 seconds out of 2 allowed
[monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2:  took 1.99 seconds out of 2 allowed
[monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26?
[monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26?
[monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26?
[monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26?
[monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2:  took 2.00 seconds out of 2 allowed
[monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2:  took 2.00 seconds out of 2 allowed
Even better, a test suite for the entire setup with all the moving parts seems very feasible now.
LinkReply

Comments:
[User Picture]From: dormando
2006-06-05 04:04 am (UTC)
Ahhh, good to see mogilefs getting some love finally! Sometime this week I'm adding a mogilefsd setup to our developer Xen network. I'll keep an eye on the mailing list list and test new code with our application under dev if you'd like.
(Reply) (Thread)
[User Picture]From: tijuanacartel
2006-06-05 08:19 am (UTC)
Yeah, Xen is great for testing anything to do with clustering that requires more than one node to be useful or work. I'm going to be doing some testing with Lustre quite soon, which should be fun :)
(Reply) (Thread)