BART and UTF-8 [Jun. 4th, 2005|07:50 pm]
Waiting for a BART this afternoon I was looking at the LED reader board to see when the next train was coming. Some advertisement about an upcoming concert came on, but where there was supposed to be punctuation at some point (probably an EM DASH, U+2014), there were instead two iso_8859-1 characters (some vowel with diacritic and something else).

Theory: somebody wrote up the advertisement in MS-Word (which auto-corrected a double-hyphen into an em-dash), copy/pasted it into a web form (with the paste preserving the Unicode character), submitted it to a website that lets you add advertisements to BART reader boards, the browser converted the Unicode in the textarea to utf-8... then the stupid cgi script blindly took it and passed it off to the reader board, which only knows iso_8859-1/Latin-1, and not utf-8, because it was built so long ago.

But I stilllllll love tech-nol-ogyyyyyyy....
... but noooooooot as much as-you-you-see.....

[User Picture]From: brad
2005-06-05 03:05 am (UTC)
MS Word nor the browser aren't to blame here. Word properly auto-converted the punctuation, and put Unicode data in the copy/paste buffer. The browser properly put Unicode paste results in the textarea, then the browser properly converted it to utf-8, per standards/conventions. (you submit to server in same charset the document was sent to you....)

The problem rests entirely with the web app that took 8-bit data blindly without transcoding it.
[User Picture]From: xaosenkosmos
2005-06-05 03:35 am (UTC)
A lot of forms wind up as ISO-8859-1 instead of UTF-8, since Apache "helpfully" appends a charset declaration to your Content-Type if you omit it. Browsers apparently trust Apache more than the page's author: An example, the source. You can pass in whatever you'd like for enc and str, to see hwo things go.

This problem motivated HEBCI, which should solve the problem once i get a complete implementation.
[User Picture]From: brad
2005-06-05 03:58 am (UTC)
HEBCI: awesome!
From: phil
2005-06-05 03:07 am (UTC)
Don't be jealous that I've been chatting online with babes all day.
[User Picture]From: valiskeogh
2005-06-05 03:32 am (UTC)
[User Picture]From: pyesetz
2005-06-05 04:53 am (UTC)

You think *that's* bad

This example is worse, being embossed in plastic rather than flashed momentarily on a display screen.
[User Picture]From: brad
2005-06-05 05:12 am (UTC)

Re: You think *that's* bad

[User Picture]From: muerte
2005-06-05 05:10 am (UTC)
Mad props for working Napoleon Dynamite into a post about reader boards.
[User Picture]From: edm
2005-06-05 05:47 am (UTC)
My favourites are when it's been through this UTF-8 pretending to be ISO-8859-1 (Latin1) more than once. One source (unicode) character becomes three or four psuedo-Latin1 character. Some of them seem to crop up so often that I can look at the sequence and say "oh, that's supposed to be a 'smart opening quote'" or whatever.

I even see them from time to time on feeds syndicated onto LiveJournal (eg, Lawrence Lessig) although I suspect it's not LJ causing it, and I see it in other places far more often.

But I think you get extra points for seeing it on a scrolling sign...

[User Picture]From: jwz
2005-06-05 09:08 am (UTC)
I'm kinda surprised the LED sign does anything beyond ASCII...
[User Picture]From: brad
2005-06-05 09:15 am (UTC)
[User Picture]From: uke
2005-06-05 06:08 pm (UTC)
On a mildly-related note: the signs outside BART stations which display elevator outages appear to be controlled via IRC. There are occasionally connection errors which scroll past.
[User Picture]From: brad
2005-06-05 06:57 pm (UTC)
Heh, nice.
