Brad Fitzpatrick (brad) wrote,
Brad Fitzpatrick
brad

mogilefs maintenance

I've gutted the 3,500 line mogilefsd server and broke it into manageable (and much cleaner!) pieces:
   749   2934  23640 mogilefsd
  1073   4108  35709 lib/MogileFS/Worker/Query.pm
   357   1561  13546 lib/MogileFS/Worker/Replicate.pm
   114    372   3436 lib/MogileFS/Worker/Monitor.pm
    99    316   2691 lib/MogileFS/Worker/Reaper.pm
   137    519   4810 lib/MogileFS/Worker/Delete.pm
    47    100    778 lib/MogileFS/Worker.pm
    52    158   1172 lib/MogileFS/Connection/Client.pm
    84    242   2039 lib/MogileFS/Connection/Worker.pm
   665   3016  23216 lib/MogileFS/ProcManager.pm
   144    504   5293 lib/MogileFS/Config.pm
    13     34    242 lib/MogileFS/Sys.pm
   112    380   2719 lib/MogileFS/Util.pm
  3646  14244 119291 total
Damn that feels good.

What also feels good is having Xen this time around. When we originally wrote MogileFS all our testing was on physical boxes. Now I can run a whole Mogile farm on a single box.

More importantly, though, is improved error testing.

There are two major types of errors in a distributed appication:

1) immediate error (get a RST on connect, or some other protocol-level error). basically you find out about the error almost immediately after you do your action because the remote machine is alive and responding, just not how you want.

2) talking to a black hole. machine is off the net and you just have to timeout and assume it's dead after enough time goes by.

Testing and designing for #1 is easy.

The blackhole case is where it gets tricky. Before Xen, the only easy way to test #2 was pulling ethernet cables. Yes, you could also do firewall rules and such, but people do it a lot less often. If you look at people's code (including mine for a long time), all too often they assume the failure will happen immediately.

Long story short: I'm pretty sure MogileFS used to be blackhole-error safe, I definitely remember testing it, but it's not in a test suite that automates testing of such errors, and that's the real problem. As a result, we've regressed in certain protocol modes which now hang way too long before making progress.

Testing blackholes with Xen is so much easier than firewall rules:
sammy:/home/bradfitz# xm unpause mog2
sammy:/home/bradfitz# xm pause mog2
sammy:/home/bradfitz# xm unpause mog2
sammy:/home/bradfitz# xm pause mog2
sammy:/home/bradfitz# xm unpause mog2
And then watch logs as MogileFS detects the different classes of errors:
[monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2:  took 2.00 seconds out of 2 allowed
[monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2:  took 1.99 seconds out of 2 allowed
[monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26?
[monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26?
[monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26?
[monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26?
[monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2:  took 2.00 seconds out of 2 allowed
[monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2:  took 2.00 seconds out of 2 allowed
Even better, a test suite for the entire setup with all the moving parts seems very feasible now.
Tags: mogilefs, tech, xen
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 2 comments