|Never ending feed of Atom feeds
||[Aug. 16th, 2005|12:58 pm]
An increasing number of companies (large and small) are really insistent that we ping them with all blog updates, for reasons I won't rant about.
Just to prove a point, I flooded a couple of them and found that sure enough, nobody can really keep up. It's even more annoying when they don't even support persistent HTTP connections.
So --- I decided to turn things on their head and make them get data from us. If they can't keep up, it's their loss.
Prototype: (not its final home)
$ telnet danga.com 8081
GET /atom-stream.xml HTTP/1.0<enter>
And enjoy the never ending XML stream of Atom feeds, each containing one entry. And if you get more than 256k behind (not including your TCP window size), then we start dropping entries to you and you see:
<sorryTooSlow youMissed="23" />
I think soon we'll get TypePad and perhaps MovableType blogs all being sent through this. The final home will probably be on a subdomain of sixapart.com somewhere, including documentation better than this blog entry.
And yes, I'm sure my Atom syntax is bogus or something. I spent a good 2 minutes on that part of it.
I don't know what I'm doing wrong, but I've written some python code that looks at this, and it seems like there are way more entries than LiveJournal's front page is saying: as in, about 1300 entries per minute rather than just 200 that it says right now.
[crschmidt@creusa ~]$ python test.py
Entries: 100. Entries/second: 19.2611058701. Time: 1124282733
Entries: 200. Entries/second: 22.5625200497. Time: 1124282737http://crschmidt.net/python/ljentries.py
is the code: am I really insane? did I do something wrong? I don't see obvious dupes in my code, and although I'm assuming I'd miss entries (if they broke over a 1024 barrier, since I'm not looking at the buffer) I don't think that I can think of a way I'd get extras...
Added in some extra collision checking, and found out that there are indeed clashes, but I can't seem to figure out why/where they're coming from. All my code is in that Python link up there.
*shrug* No clue what's up, but thought you might want to know. It almost seems like you're grabbing a full set of 100 new URLs from the cache every 10 seconds or so, and not checking if they're already printed out somehow... but that doesn't make any sense at all. So it's probably my code, but I can't figure out how.
Okay, so I just checked it in telnet, and I'm definitely seeing the same entries over and over. So, ignore my earlier comment.
2005-08-17 09:37 pm (UTC)
My ghetto injector is surely at fault. I'll fix soon here. This was just a prototype, after all.
I ran the stream through "grep sorry", curious whether I would get that sorryToSlow tag. Of course, this shows me all the post lines with "sorry" in them, apparely LJ users are an apologetic bunch. However, I noticed that I saw the same posts over and over - does anybody else notice this?
2005-08-17 09:29 pm (UTC)
That's kind of neat. I grabbed a 300M snapshot, wrote 10 lines of python to break it into pieces, and fed it to a metacarta appliance, to see if it keeps up :-) I'll let you know if I find anything "interestingly geographic" out of it...
2005-08-17 11:35 pm (UTC)
"bogus or something"
According to the current Atom spec
, the <feed> is supposed to contain an <id> with the URI of the feed (rather than just a <link>) and an <updated> as well; these two, plus the <title>, are the only mandatory bits of <feed>.
The same three elements are the only mandatory bits of <entry>. The standard LJ .../data/atom Atom feeds seem to use some hokey URN for the entry id, but there's no need for that --- LJ posts have perfectly valid, dereferenceable perma-URLs, which are presently in the <link> elements in this stream.
I'm no Atom expert, so I hope this is helpful. Presumably either you will change the stream format to be valid Atom, or people who want to feed these things to their current Atom-consuming software will have to transform your current format into valid Atom --- what do you plan to do? I'd hate to put effort into doing the mapping myself if you're about to fix it.
2005-08-17 11:37 pm (UTC)
Re: "bogus or something"
The Atom LJ is injecting is total absolute crap. It's just a demo.
TypePad's injecting real Atom and LJ's about to start.
All I see is blonde, brunnette, redhead...
2005-08-21 11:52 pm (UTC)
Just stream Atom files. You don't need to define a new format.
As you know, since I asked for this over a year ago, I think your proposal to push streams of LiveJournal updates is wonderful! At PubSub, we currently read the LiveJournal latest.bml file as frequently as once a minute, if not more frequently. The result has been that we're able to keep up to date on LiveJournal posts with massively less bandwidth and processing cost than with any other blogging system. Thanks for moving in the direction of the streaming the updates.
While I like what you're doing, I must admit that I am not terribly excited by the *way* you are proposing to do it. I believe that what you want to do can be done by streaming a feed of atom:entry's which contain atom:source elements carrying the feed metadata. The atom:source element was explicitly designed for exactly the kind of application you are proposing. (i.e. the generation of an aggregate feed). I've discussed your proposal on both the FeedMesh and Atom-Syntax lists. Please take a look at the alternatives I propose in these messages:http://www.imc.org/atom-syntax/mail-archive/msg16794.htmlhttp://groups.yahoo.com/group/feedmesh/message/451
Thanks again for considering push feeds of the LiveJournal updates. My hope is that other blog hosting platforms will follow your lead and implement similar feeds as soon as possible. Hosters who stream feeds in this manner will, I think, be seen to be serving the needs of the users much better since it will be much more likely that user's posts will get into search and/or monitoring systems. Also, the bandwidth and processing impact of having services consume the stream of updates will be much less than required if they are all fetching raw RSS/Atom files. This means that a greater proportion of the bandwidth and processing power you consume can be dedicated to providing your users with new and/or faster services. A win for everyone!
2005-08-22 12:13 am (UTC)
Less White-Space please...
It would be very nice if you were to remove some of the unnecessary white-space from the feed. A feed like this can only be usefully read by a machine I think. We'd all save a great deal of bandwidth if we didn't have to pay for the white-space. (Yes, this is a very small point...)