749 2934 23640 mogilefsd 1073 4108 35709 lib/MogileFS/Worker/Query.pm 357 1561 13546 lib/MogileFS/Worker/Replicate.pm 114 372 3436 lib/MogileFS/Worker/Monitor.pm 99 316 2691 lib/MogileFS/Worker/Reaper.pm 137 519 4810 lib/MogileFS/Worker/Delete.pm 47 100 778 lib/MogileFS/Worker.pm 52 158 1172 lib/MogileFS/Connection/Client.pm 84 242 2039 lib/MogileFS/Connection/Worker.pm 665 3016 23216 lib/MogileFS/ProcManager.pm 144 504 5293 lib/MogileFS/Config.pm 13 34 242 lib/MogileFS/Sys.pm 112 380 2719 lib/MogileFS/Util.pm 3646 14244 119291 totalDamn that feels good.
What also feels good is having Xen this time around. When we originally wrote MogileFS all our testing was on physical boxes. Now I can run a whole Mogile farm on a single box.
More importantly, though, is improved error testing.
There are two major types of errors in a distributed appication:
1) immediate error (get a RST on connect, or some other protocol-level error). basically you find out about the error almost immediately after you do your action because the remote machine is alive and responding, just not how you want.
2) talking to a black hole. machine is off the net and you just have to timeout and assume it's dead after enough time goes by.
Testing and designing for #1 is easy.
The blackhole case is where it gets tricky. Before Xen, the only easy way to test #2 was pulling ethernet cables. Yes, you could also do firewall rules and such, but people do it a lot less often. If you look at people's code (including mine for a long time), all too often they assume the failure will happen immediately.
Long story short: I'm pretty sure MogileFS used to be blackhole-error safe, I definitely remember testing it, but it's not in a test suite that automates testing of such errors, and that's the real problem. As a result, we've regressed in certain protocol modes which now hang way too long before making progress.
Testing blackholes with Xen is so much easier than firewall rules:
sammy:/home/bradfitz# xm unpause mog2 sammy:/home/bradfitz# xm pause mog2 sammy:/home/bradfitz# xm unpause mog2 sammy:/home/bradfitz# xm pause mog2 sammy:/home/bradfitz# xm unpause mog2And then watch logs as MogileFS detects the different classes of errors:
[monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2: took 2.00 seconds out of 2 allowed [monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2: took 1.99 seconds out of 2 allowed [monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26? [monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26? [monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26? [monitor(4187)] Port 7500 not listening on otherwise-alive machine 10.0.0.26? [monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2: took 2.00 seconds out of 2 allowed [monitor(4187)] Timeout contacting machine 10.0.0.26 for dev 2: took 2.00 seconds out of 2 allowedEven better, a test suite for the entire setup with all the moving parts seems very feasible now.