An increasing number of companies (large and small) are really insistent that we ping them with all blog updates, for reasons I won't rant about.
Just to prove a point, I flooded a couple of them and found that sure enough, nobody can really keep up. It's even more annoying when they don't even support persistent HTTP connections.
So --- I decided to turn things on their head and make them get data from us. If they can't keep up, it's their loss.
Prototype: (not its final home)
$ telnet danga.com 8081 GET /atom-stream.xml HTTP/1.0<enter> <enter>
And enjoy the never ending XML stream of Atom feeds, each containing one entry. And if you get more than 256k behind (not including your TCP window size), then we start dropping entries to you and you see:
<sorryTooSlow youMissed="23" />
I think soon we'll get TypePad and perhaps MovableType blogs all being sent through this. The final home will probably be on a subdomain of sixapart.com somewhere, including documentation better than this blog entry.
And yes, I'm sure my Atom syntax is bogus or something. I spent a good 2 minutes on that part of it.
Hmm. I don't understand how but somehow when I try this on my Windows box here it starts printing out gibberish and items from my command history going back weeks. Weird stuff. On my linux box it messes up my terminal. I guess there are some control characters in there causing weirdness. Good thing that telnet isn't the recommended way to access this. ;)
Hopefully this stuff will start getting PubSubbed at some point, assuming that someone can keep up with it well enough to produce decent PubSub output.
An increasing number of companies (large and small) are really insistent that we ping them with all blog updates, for reasons I won't rant about.
Uhh, do they have rocks in their heads? At ~6 posts a second, thats like asking to take a shower via a firehose.
I'd just get something in writing just to confirm they really want to do this. Then when you open the proverbial floodgates upon them they cant nail you or 6A for blowing the servers off of the face of the planet.
In their defense its probably some marketing/management drone pressing for you to do this, while the admins are already getting their white flags and upstream blocks in place.
It's not such a BFD --- in the 4 minutes 6 seconds I listed, I got about 1.3 megabytes, comprising 733 posts. That's only about 5 kB/sec, which you could handle on a 56kbps modem, and a little under 3 posts per second. (Presumably this is the public half of LJ.) That's, like, half a gigabyte a day, or a few dimes worth of bandwidth. The junky but relatively optimized full-text indexer I use on my email can reindex a gigabyte in about ten minutes on my old 600MHz laptop. (see kragen-hacks archives for details.)
I am not really qualified to speculate on why LJ needs a big server farm to handle 6 posts per second (although bradfitz's talk at OSCON was really great), but I'm guessing that it includes items from the following list: pageviews, comments, authorization, usericons, reliability, friends pages. All of these are required to actually run LJ, but you don't need any of them to slurp from this particular firehose.
FWIW I think this is really excellent. Thanks Brad! Yay perlbal!
With a little bit of extra markup (plus ideally a little bit of support for handshaking on connection) this could become an XMPP stream and the existing XMPP client libraries would be able to suck it up, saving people from having to write new parsing code (which is a pain because many XML libraries won't play nice when there's no proper end to a document). I guess once it stops being HTTP Perlbal becomes less helpful, though. Last I checked the XMPP libraries for Perl were a little clunky as well.
The two obvious ways of parsing it would be to either use a SAX API parser (ie, process the start/stop tags as they come in), or do some high-level parsing to spot start/end of Atom entries and then wrap those in enough surrounding junk to make a DOM parser happy. Neither seems especially difficult to do.
I, too, love the <sorryTooSlow/> tag. And marvel at the people who ever thought their architecture could take 6+ incoming posts a second and do useful things with them, without a LJ sized infrastructure.
Bloody brilliant. You're like that kid at school. The one in the school yard. With the firehose. Going full tilt. A kid behind you says, "Brad?," and you turn casually then watch as he skids on his ass across the blacktop into a cinder block wall. "Brad, could you?" You turn again.