Brad Fitzpatrick (brad) wrote,
Brad Fitzpatrick

LazyWeb: CS types: what is this thing?

Computer Science types,

Help me identify something.

I don't know what the proper name for this thing we're using is called. It's close to a Linda Tuplespace, but I'm reluctant to call it that if there's some major characteristic we're lacking. It might also be called a Message Queue or Message Bus, but it feels like a big stretch.

Here's what it is.....

Operations, renamed for clarity:

INSERT(JobType, JobOpaqueArgs) returning JobHandle
ATOMIC_GRAB(JobType, lockInterval) returning a JobHandle
-- Atomic fetch of that a JobHandle matching type JobType, and won't be given out again for lockInterval seconds.
ATOMIC_GRAB_WITH_SEARCH(JobType, [Other_search_parameters], lockInterval) returning a JobHandle
-- Atomic fetch of that a JobHandle matching type JobType, and won't be given out again for lockInterval seconds.
-- marks a JobHandle as done
TEMP_FAIL(JobHandle, message, retryInNSeconds)
PERM_FAIL(JobHandle, message)
DETAILS(JobHandle) returns details of job: its JobOpaqueArgs, history of failures, etc...
REPLACE(JobHandle, [JobType, JobOpaqueArgs]+) -- atomically replace one job (mark it as done) with n other jobs.

The idea, as you've discovered, is that it's job queueing system, with optional retries. We can throw crap into it and not care who's supposed to do it. The database scales horizontally (there are multiple) and each pair is HA. Workers are spread all over to, and only grab what they're able to do. If workers don't report back in lockInterval seconds, the job can be given back out to somebody else. (we're going to track the hostname and pid that grabbed the job too, so secondary grabbers can STONITH if they need to, if that satisifies their job class' guarantees)

We're using this for both ESN and sending email. In ESN, every action on the site generates an event which goes into this. That part's done in web context. After that, it's all async with web request. Processing the FiredEvent job type looks up listeners, replacing the FiredEvent job with 0 or more "ProcessSubscription" events. Those are then processed (designed to scale up to millions of those from a single event) into 0 or more (likely 1-3) "ProcessNotification" events for Email/SMS/IM/WebservicesPing/etc. All from a single event. Obviously we couldn't have done that all during an HTTP request, but it had to get done, and by lots of machines/processes in parallel.

Likewise with email. We put an email in, then somebody picks it up, does DNS lookup, and delivers to final SMTP server (no intermediate Postfix/sendmail/etc on our side). On 5xx error, we're done. Otherwise we do the retry-less-and-less for 4-5 days thing, as per the SMTP spec. This is about a page of code, in contrast to ESN. Fortunately the job management system does all the dirty work.

So what is this thing called?

I'd like to promote it to others, if only I could succinctly describe it.
Tags: lazyweb, tech
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.