?

Log in

No account? Create an account
LazyWeb: CS types: what is this thing? - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

LazyWeb: CS types: what is this thing? [Aug. 16th, 2006|11:58 pm]
Brad Fitzpatrick
[Tags|, ]

Computer Science types,

Help me identify something.

I don't know what the proper name for this thing we're using is called. It's close to a Linda Tuplespace, but I'm reluctant to call it that if there's some major characteristic we're lacking. It might also be called a Message Queue or Message Bus, but it feels like a big stretch.

Here's what it is.....

Operations, renamed for clarity:

INSERT(JobType, JobOpaqueArgs) returning JobHandle
ATOMIC_GRAB(JobType, lockInterval) returning a JobHandle
-- Atomic fetch of that a JobHandle matching type JobType, and won't be given out again for lockInterval seconds.
ATOMIC_GRAB_WITH_SEARCH(JobType, [Other_search_parameters], lockInterval) returning a JobHandle
-- Atomic fetch of that a JobHandle matching type JobType, and won't be given out again for lockInterval seconds.
COMPLETE(JobHandle)
-- marks a JobHandle as done
TEMP_FAIL(JobHandle, message, retryInNSeconds)
PERM_FAIL(JobHandle, message)
DETAILS(JobHandle) returns details of job: its JobOpaqueArgs, history of failures, etc...
REPLACE(JobHandle, [JobType, JobOpaqueArgs]+) -- atomically replace one job (mark it as done) with n other jobs.

The idea, as you've discovered, is that it's job queueing system, with optional retries. We can throw crap into it and not care who's supposed to do it. The database scales horizontally (there are multiple) and each pair is HA. Workers are spread all over to, and only grab what they're able to do. If workers don't report back in lockInterval seconds, the job can be given back out to somebody else. (we're going to track the hostname and pid that grabbed the job too, so secondary grabbers can STONITH if they need to, if that satisifies their job class' guarantees)

We're using this for both ESN and sending email. In ESN, every action on the site generates an event which goes into this. That part's done in web context. After that, it's all async with web request. Processing the FiredEvent job type looks up listeners, replacing the FiredEvent job with 0 or more "ProcessSubscription" events. Those are then processed (designed to scale up to millions of those from a single event) into 0 or more (likely 1-3) "ProcessNotification" events for Email/SMS/IM/WebservicesPing/etc. All from a single event. Obviously we couldn't have done that all during an HTTP request, but it had to get done, and by lots of machines/processes in parallel.

Likewise with email. We put an email in, then somebody picks it up, does DNS lookup, and delivers to final SMTP server (no intermediate Postfix/sendmail/etc on our side). On 5xx error, we're done. Otherwise we do the retry-less-and-less for 4-5 days thing, as per the SMTP spec. This is about a page of code, in contrast to ESN. Fortunately the job management system does all the dirty work.

So what is this thing called?

I'd like to promote it to others, if only I could succinctly describe it.
LinkReply

Comments:
[User Picture]From: taral
2006-08-17 07:36 am (UTC)
You've got primarily a searchable queue with an adjoined timer queue. The JobType stuff is just a way of mixing together many of these, but since there is no operation that can span JobTypes, they can be considered separate.
(Reply) (Thread)
[User Picture]From: ciphergoth
2006-08-17 07:41 am (UTC)
REPLACE spans job types, doesn't it?
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-08-17 07:46 am (UTC)
It does. We frequently replace jobs with those of other/mixed types.

So really the job type isn't any more special that the other parameters that we can search on while grabbing.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-08-17 07:47 am (UTC)
(it's only special in that we have database indexes to optimize for all the common searches)
(Reply) (Parent) (Thread)
[User Picture]From: taral
2006-08-17 11:29 pm (UTC)
Mmm, good point. Okay, so you'd have to instantiate the multiples under the atomicity umbrella.
(Reply) (Parent) (Thread)
[User Picture]From: robbat2
2006-08-17 08:11 am (UTC)
Batch processing system with characteristics of gaurenteed execution (covers your atomic grab etc) and intelligent allocation (covers the search parameters, but they would generally be on the master scheduling box and not be searched from the clients, for various reasons).

Existing examples include "Generic NQS" (with it's parentage starting at NASA in the 1980s IIRC), PBS, OpenPBS, Torque (a derivitive of OpenPBS).
Torque is the current state of the art / actively maintained application, with pluggable scheduling mechanisms.

It's been a while since I looked at them, but the only potentially interesting stuff you have here is the side of reliably tracking if a job is completed - there are lots of failure cases that might send an email, but still return failure, with nasty results like spamming inboxes - the ultimate solution being breaking a job down into sub-jobs and having your batching system handle job dependancies.

I will say however that if you can bring your previous skills at approachable and easily deployable code (memcached/mogilefs/perlbal being examples), it will make batch systems MUCH more approachable than the existing mess that is Torque et al.
(Reply) (Thread)
From: jamesd
2006-08-17 09:26 am (UTC)
Just use "job manager" or "job queue". The former if it handles the worker parts of the task as well as the queue itself.
(Reply) (Thread)
[User Picture]From: dossy
2006-08-17 02:19 pm (UTC)
Smells like Pub-Sub to me.
(Reply) (Thread)
[User Picture]From: brad
2006-08-17 04:24 pm (UTC)
Nope. I've read pub-sub specs and pub-sub smells entirely different.
(Reply) (Parent) (Thread)
From: staticground
2006-08-17 04:56 pm (UTC)

If I were a marketing geek

How about:

* Hierarchical Job Director/Orchestrator
* Recursive Job Manager
* Queued Task Digestor

(Reply) (Thread)
From: divelog
2006-08-17 05:31 pm (UTC)

Queue?

So the ATOMIC_GRAB_* functions grab the oldest jobs in the queue? That wasn't explictily stated in your description.

Sounds like a basic scheduling fifo to me.
(Reply) (Thread)
From: (Anonymous)
2006-08-30 08:00 am (UTC)

some links

http://www.google.com/search?q=workflow+engine
http://en.wikipedia.org/wiki/Workflow
http://en.wikipedia.org/wiki/Blackboard_system
(Reply) (Thread)