?

Log in

No account? Create an account
Unicode::CheckUTF8 - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Unicode::CheckUTF8 [Dec. 8th, 2005|12:40 am]
Brad Fitzpatrick
[Tags|, ]

Here's my latest contribution to the CPAN: Unicode::CheckUTF8.

It was originally Inline.pm code that haunted us for years (Inline deployment sucks) so Artur helped me convert it to XS. It's pretty pathetic I even had to do this, but nothing better's come out in the meantime.

All the other options are either incorrect or incorrect and segfault.

So Unicode::CheckUTF8 is:

-- correct (see test suite!)
-- fast (written in C)
-- doesn't use regexp engine (see fast, but also so it doesn't segfault)

Please, correct me if I'm wrong and something else works, but run your answer through my test suite first. Things known to misbehave, and the reasons if I'm aligning them properly:

-- w3c's recommended regexp segfaults perl with ease
-- Encode, Unicode::String -- don't reject low ascii bytes that expat/mozilla reject

*yawn* Well, I guess I can be excited about getting rid of Inline from production.
LinkReply

Comments:
From: bpavlak
2005-12-08 08:52 am (UTC)
Isn't it past your bed time?
(Reply) (Thread)
From: pos_le_terrible
2005-12-08 10:26 am (UTC)
Couldn't this solve your problem ?
http://search.cpan.org/~jgmyers/Encode-Detect-0.01/Detector.pm

It uses Mozilla's detection machansim, to should be quite correct and efficient (I did not test it)

But detecting charset encoding is one thing, and converting it to proper utf8 is another (and this module uses the standard Encode module for that...)
(Reply) (Thread)
[User Picture]From: brad
2005-12-08 06:54 pm (UTC)
I don't want to detect what encoding it is. I'm assuming it's UTF-8, because that's all we accept, then I want to ask: Is it perfect UTF-8 with no flaws?
(Reply) (Parent) (Thread)
From: pos_le_terrible
2005-12-08 07:13 pm (UTC)
ok, I though it would only detect perfect utf8 and reject any false utf8, but again I didn't test it so it might also just accept utf8 as the "most probable" encoding.

But you say that you want to detect bad utf8 that would be rejected by Expat or Mazilla, so if Mozilla detects it as utf8 will it reject it afterward (since this code is based on Mozilla)?
(Reply) (Parent) (Thread)
[User Picture]From: brad
2005-12-08 08:25 pm (UTC)
I couldn't parse your second paragraph. In any case, encoding guessing is not what I'm after.
(Reply) (Parent) (Thread)
[User Picture]From: gaal
2005-12-08 11:31 am (UTC)
Can you say what makes Inline suck at deployment? I can imagine (a) its lazy approach to building and (b) how binary objects are stored in something that's essentially a cache both have to do with it. I'm wondering if there can't be a best-of-both-worlds solution though, because truly it is sad to hear you preferring the pain of XS over Inline.
(Reply) (Thread)
From: evan
2005-12-08 04:20 pm (UTC)
Do you do any Unicode normalization on LJ?
(For example, I could imagine people using "b j o-umlaut r k" and "b j o combining-umlaut r k" as interests.)
(Reply) (Thread)
[User Picture]From: gaal
2005-12-08 05:48 pm (UTC)
Many'd just say "b j & o a c u t e ; r k".

(Well. Most'd just say "b j o r k" :-)
(Reply) (Parent) (Thread)
[User Picture]From: gaal
2005-12-08 05:49 pm (UTC)
s/acute/umlaut/
(Reply) (Parent) (Thread)
[User Picture]From: brad
2005-12-08 06:55 pm (UTC)
None.
(Reply) (Parent) (Thread)
[User Picture]From: mart
2005-12-08 10:46 pm (UTC)

Unless something's changed very recently, LiveJournal can't even convert non-ascii characters to lowercase! ;)

(Reply) (Parent) (Thread)
[User Picture]From: taral
2005-12-08 06:00 pm (UTC)
Perl segfaults on a regexp? That would count as a Bad Thing (tm).
(Reply) (Thread)