DragonFly BSD
DragonFly submit List (threaded) for 2004-03
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: UTF8 locale MFC for DragonflyBSD [tjr@FreeBSD.org: cvs commit: src/etc/mtree BSD.usr.dist src/share/colldef Makefile src/share/mklocale Makefile UTF-8.src src/share/monetdef Makefile be_BY.UTF-8.src bg_BG.UTF-8.src cs_CZ.UTF-8.src en_GB.UTF-8.src...]


From: David Cuthbert <dacut@xxxxxxxxx>
Date: Sun, 28 Mar 2004 21:35:58 -0500

Xin LI wrote:
> Having utf8'ized locales in DragonFlyBSD
will bring better internationalization, and personally, I believe that
making utf-8 the default internal locale will make DragonFly the best
UNIX-like platform to write internationalized applications.

Agreed wholeheartedly.


Out of curiosity, do you know how the FreeBSD folks handled some of the stranger UTF-8 behavior? In particular:

1. Representation of embedded NUL characters. The UTF-8 spec says this is one byte == 00000000b; Java and a few others, though, have used a double-byte encoding (110/00000 + 10/000000) so that stuff like strlen() works "reasonably."

2. Maximum size of a character. To represent UCS-2 characters, you only needed up to 3 bytes (1110/xxxx + 10/xxxxxx + 10/xxxxxx). Unfortunately, surrogates make it necessary to do 6 byte encodings. Ironically, UTF-8 unaware routines handle these fine; some of my older UTF-8 handling routines, though, barf on stuff above and beyond U+10000. (Fortunately, none of these have escaped Neolinear... I hope...)

3. Security issues. The UTF-8 and Unicode FAQ [1] states that "a UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to encode a character," noting that "any overlong UTF-8 sequence could be abused to bypass UTF-8 substring tests that look only for the shortest possible encoding."

None of these issues is a show-stopper. However, it is stuff that we should check for and document.


[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html




[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]