Log in

Never ending feed of Atom feeds - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Never ending feed of Atom feeds [Aug. 16th, 2005|12:58 pm]
Brad Fitzpatrick
[Tags|, , , ]

An increasing number of companies (large and small) are really insistent that we ping them with all blog updates, for reasons I won't rant about.

Just to prove a point, I flooded a couple of them and found that sure enough, nobody can really keep up. It's even more annoying when they don't even support persistent HTTP connections.

So --- I decided to turn things on their head and make them get data from us. If they can't keep up, it's their loss.

Prototype: (not its final home)

$ telnet danga.com 8081
GET /atom-stream.xml HTTP/1.0<enter>

And enjoy the never ending XML stream of Atom feeds, each containing one entry. And if you get more than 256k behind (not including your TCP window size), then we start dropping entries to you and you see:

<sorryTooSlow youMissed="23" />

I think soon we'll get TypePad and perhaps MovableType blogs all being sent through this. The final home will probably be on a subdomain of sixapart.com somewhere, including documentation better than this blog entry.

And yes, I'm sure my Atom syntax is bogus or something. I spent a good 2 minutes on that part of it.

[User Picture]From: crschmidt
2005-08-17 12:57 pm (UTC)
I don't know what I'm doing wrong, but I've written some python code that looks at this, and it seems like there are way more entries than LiveJournal's front page is saying: as in, about 1300 entries per minute rather than just 200 that it says right now.

[crschmidt@creusa ~]$ python test.py
Entries: 100. Entries/second: 19.2611058701. Time: 1124282733
Entries: 200. Entries/second: 22.5625200497. Time: 1124282737

http://crschmidt.net/python/ljentries.py is the code: am I really insane? did I do something wrong? I don't see obvious dupes in my code, and although I'm assuming I'd miss entries (if they broke over a 1024 barrier, since I'm not looking at the buffer) I don't think that I can think of a way I'd get extras...

Added in some extra collision checking, and found out that there are indeed clashes, but I can't seem to figure out why/where they're coming from. All my code is in that Python link up there.

*shrug* No clue what's up, but thought you might want to know. It almost seems like you're grabbing a full set of 100 new URLs from the cache every 10 seconds or so, and not checking if they're already printed out somehow... but that doesn't make any sense at all. So it's probably my code, but I can't figure out how.
(Reply) (Thread)