Log in

No account? Create an account
C#: Strings without encodings? Working with buffers. - brad's life — LiveJournal [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

C#: Strings without encodings? Working with buffers. [Dec. 21st, 2003|06:52 pm]
Brad Fitzpatrick
So far I'm really liking most of C#. Here's something I can't figure out, though:

Question for HTTP wizards
Writing an HTTP header parser in C# while still being 8-bit-clean looks to be a bitch. Or are headers always ASCII? RFC 2616's grammar says octet all over, which I can only assume means all 8 bits in "oct". Or is that just ideal, and in the real world so many servers suck that clients only send 7 bit headers? Or, is there an explicit encoding in HTTP? UTF-8?

The Mono XSP webserver isn't 8-bit clean. It assumes all is ASCII.

Question for C# wizards
Is there any way to conveniently operate on a buffer of bytes, perhaps as a String, without knowing its encoding? If I have a buffer with bytes over 127, the Encoding.ASCII converts them to question marks. (which I verifed by going from byte buffer to String and back again) This is what Mono's XSP does.

And you can only do regular expressions on a string, not a buffer?

If I have anonymous 8-bit data, I should still be able to split on it, search for substrings (read: other byte arrays), and run regexps on it.

Now, I think it's nice that Strings have known encoding, but I think C# should have a more powerful System.Buffer class that allows for more than just getting/setting bytes. I should be able to do searches from another byte array. And the RegExp library should allow matching on byte buffers.

[User Picture]From: jwz
2003-12-21 07:36 pm (UTC)
In RFCs, "octet" generally refers to the fact that there are 8 bits per byte, without saying anything about how those bytes are interpreted.

Are there any HTTP headers that you care about where there are human-readable strings involved? If not, it doesn't matter, right?

The fact that I wasn't able to use the various String classes, and had to write my own "network buffer" class to do protocol-ish stuff was one of my earliest gripes with Java, too.
(Reply) (Thread)
[User Picture]From: brad
2003-12-21 07:49 pm (UTC)
See ch's comment below. HTTP headers are all ASCII.

To be fair to C#, it's much less a pain in the ass than Java. And you can just do an unsafe { ... } block anywhere and use pointers, if it comes down to it.
(Reply) (Parent) (Thread)
[User Picture]From: ch
2003-12-21 07:42 pm (UTC)
rfc2616 Section 4.2:

HTTP header fields ... follow the same generic format as that given in Section 3.1 of RFC 822 [9].

rfc 822:

The field-name must be composed of printable ASCII characters (i.e., characters that have values between 33. and 126., decimal, except colon). The field-body may be composed of any ASCII characters, except CR or LF.

Don't forget to unfold each header if needed.
(Reply) (Thread)
[User Picture]From: brad
2003-12-21 07:46 pm (UTC)
Ah, wonderful!

I saw references in rfc2616 to "MIME-like" all over, and I'd seen rfc822's "ASCII-only" bit you posted, but not that section 4.2 you quoted. Thanks!
(Reply) (Parent) (Thread)
[User Picture]From: banana
2003-12-22 04:50 am (UTC)
Also from rfc2616 s4.2:
field-content = <the OCTETs making up the field-value and consisting of either *TEXT or combinations of token, separators, and quoted-string>
...and from s2.2:
Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047.
...so (=?ISO-8859-1?Q?a?=) is equivalent to </code>(a)</code>
(Reply) (Parent) (Thread)
[User Picture]From: taral
2003-12-22 09:28 am (UTC)
Sorry, that's wrong. The EBNF is quite specific, the only constraint on headers is that they be TEXT = any OCTET except CTLs, but including LWS. Specifically, quoted-string and entity-tag use this.

However, I don't see how using anything above ASCII is useful, since no encoding is defined. Wait...

Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047.

There you go. So it's ISO-8859-1.
(Reply) (Parent) (Thread)
[User Picture]From: scosol
2003-12-21 09:07 pm (UTC)
HTTP headers are clearly defined- I would think that you'd be perfectly safe leaving out anything that you couldn't "read".
(Reply) (Thread)
[User Picture]From: scosol
2003-12-21 09:10 pm (UTC)
note to self: read comments before commenting :p
(Reply) (Parent) (Thread)
[User Picture]From: jope
2003-12-22 01:39 am (UTC)
There's what the headers are supposed to be, and there's what might get thrown at you. Unless this has a proxy in front of it (and maybe even then, depending on the proxy), be prepared for non-ASCII -- and in particular null-bytes -- from attackers.
(Reply) (Thread)
[User Picture]From: brad
2003-12-22 01:52 am (UTC)
With LiveJournal's traffic and history of attacks, I'm being paranoid as hell.

Writing this in C# is step #1 of paranoia.
(Reply) (Parent) (Thread)
[User Picture]From: jope
2003-12-22 06:40 pm (UTC)
Are there C# bindings for PCRE? I looked (not too hard though), but didn't find any. I know it operates on buffers (rather than strings) and can handle embedded nulls, so it's worked great for our needs thus far.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2003-12-22 07:02 pm (UTC)
PCRE is built-in to C#. Or rather, it's part of the standard framework. Nno language-level support for it, so it's somewhat hard to use.... although there are constructs to write regexps easier without escaping all the backspaces, and that's language-level, so it was definitely planned for. There are even a few cool additions. You can name matches. Instead of the default $1, $2, $3, etc. You can do: (?<name>\S+) and get at that grouping by Match.Group["name"] instead of Group[1] ($1).

And I figured out the trick to work on buffers: Just convert a byte buffer to a string using any simple charset with high ascii (say, ISO-8859-1), work on it with the regexp that's probably pure ASCII. Then convert it back to bytes using the same charset.

So in the end, I'm happy.
(Reply) (Parent) (Thread)
[User Picture]From: mart
2003-12-22 03:36 am (UTC)

It always seems weird to use regexes when they aren't an intrinsic part of the language. I always feel like the regex library is intended for allowing users of the application to use regexes rather than the programmer.

It'd be much cooler if the compiler could translate the regex into real code at compile time rather than just including a string for the library to deal with at runtime. Oh well.

(Reply) (Thread)
[User Picture]From: taral
2003-12-22 09:34 am (UTC)
As for binary strings, the answer is yes. Write a new Encoding/Encoder/Decoder class set that represents "application/octet-stream". :)
(Reply) (Thread)