2008-07-07 10:38 pm (UTC)
Simpler than JSON or a different usage requirement?
2008-07-07 11:19 pm (UTC)
Why not just use Thrift
? It already has Perl support, is similarly open source, both a DDL and IDL, featuring binary encoding, and supported by a large company (Facebook, with Apache incubation).
2008-07-07 11:14 pm (UTC)
2008-07-08 12:52 am (UTC)
We ended up reinventing this wheel at Danger
too. It's nice to have an implementation of this with Google-cred behind it. That way nobody ever has to rewrite it again :)
find protobuf-2.0.0beta -type f -print0 | xargs -0 fgrep Brad
finds no hits, but Will Robinson's name appears often. Is there a Google protocol for when an employee should sign his work?
And next, I get Perl readability! ;-)
Thank you for saying you'll work to support Perl. I've been discussing it with a Google employee friend of mine and he said I should port it. I think you'd do a better job. Make us proud!
Interesting tradeoff in "Embedded messages" - you can read and write a stream of data without buffering it, but if you have a message inside the stream, you must buffer the entire messages before writing it so you can measure its length.
I'm not sure that's the right choice - I'd rather do what SPKI does, measure all the atoms explicitly and then add grouping symbols. That also means you don't have to muck about with 7-bit encoding of integers, which I particularly care about because I might like to send some very big integers :-)
Here's what I'd do.
def readIntOrLen(lim, stream, i):
if i < lim:
return toUint(stream.read(1 + i - lim))
# Could force eg one-byte encoding to cover range [lim, lim+255] instead of [0, 255] here but is it worth it?
firstByte = toUint(stream.read(1))
fieldId = readIntOrLen(12, stream, firstByte >> 4)
if fieldId == 0:
return specialSymbol(firstByte & 0x0f)
dataLen = readIntOrLen(8, stream, firstByte & 0x0f)
return fieldId, stream.read(dataLen)
There you go: short encoding when values are small, no special handling of different types, no need for seven-bit encoded integers or "ZigZag" encoding or any such weirdness, and room for 16 special symbols eg for grouping. Edited at 2008-07-08 03:32 pm (UTC)
I always look forward to your feedback when it comes to technical posts like this. Just wanted you to know.
Hey, thanks! I'm glad it makes sense - I tried to express what I was thinking in English, but it got really clumsy, so I thought just writing the Python for the parser would be easier.
While I'm at it, here's another couple of ideas on how to do this right:
- specialSymbol(0) is NOP. That's the same as a zero byte. So if it's convenient for some reason - eg if you're writing into a fixed-length file or some such - you can have runs of zeroes between data and they will be ignored.
- specialSymbol(1) means "start group" and specialSymbol(2) means "end group". "start group" will be followed by a groupId, in readIntOrLen(252, stream, toUint(stream.read(1))) form. Then a bunch of data items, then an "end group" symbol.
- specialSymbol(3) means "what follows is just a chunk" and specialSymbol(4) means "this is the last chunk". With these you can start writing to a value before you know the length of it, in a chunked encoding. Note that a chunk can be zero length, just in case you only discover you've come to the end of the data after you've written the last byte.
- specialSymbol(5) means "what follows is metadata". This will be followed by a group; the groupId will specify what kind of metadata it is. If the parser doesn't recognise the groupId of the metadata, it should ignore it.
- The parser must fail if it encounters a specialSymbol it doesn't understand; specialSymbol(5) should be used where information can be safely ignored.
- Where you need a "canonical encoding" for digital signature purposes, use the shortest possible encoding; there's only one such encoding.
NEWSFLASH! Google reinvents AOL's SNAC/TLV
. How long until they reimplement AOL's FLAP?
I guess this means Google's finally hired a bunch of ex-AOL'ers to work on stuff. Now, if Twitter could only hire a few ...
2008-07-09 05:34 pm (UTC)
Not only are these sorts of wire formats very common (see above comments about Danger's as well as Facebook's), and not only does this format date back nearly a decade ago (which doesn't mean it necessarily predates TLV, but certainly that it predates AOL hires), but most importantly the format of a TLV is not even comparable to these formats: TLV uses four bytes per type+len field with opaque while protos are all about efficient bitpacking of integers and structs. Did you even read the documentation?
2008-07-08 06:57 pm (UTC)
Let me just summarise much of the discussion I've seen on this so far: "why did you not use my favourite (possibly niche) XDR" or "You don't like XML? You're a poopy-head!"
I think the generic response (and the FAQ suggestion to evaluate the choice of XDR/IDL in the context of what you're actually doing yourself) should really be, "We use this shit at Google, and it works, bitches!"
(I do not mean to suggest that there is anything wrong with ciphergoth
's suggestion, since that is both complimentary to the ProtoBuf stuff and constructive criticism)