DragonFly BSD
DragonFly submit List (threaded) for 2004-03
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: UTF8 locale MFC for DragonflyBSD


From: Dave Cuthbert <dacut@xxxxxxxxxxxxx>
Date: Mon, 29 Mar 2004 11:09:08 -0500

Joerg Sonnenberger wrote:
On Sun, Mar 28, 2004 at 09:35:58PM -0500, David Cuthbert wrote:
1. Representation of embedded NUL characters. The UTF-8 spec says this is one byte == 00000000b; Java and a few others, though, have used a double-byte encoding (110/00000 + 10/000000) so that stuff like strlen() works "reasonably."

This is a clear violation of the minimum requirement. Since you should always use explicitly sized strings instead of delimited strings for processing of strings with possible embedded NULLs, I don't think we want this. Actually this is one of the view checks we might want to do in the kernel if we want to go the UTF-8 road in the future. It would be pretty bad to be able to create undelete / unviewable files :)

Agreed. Actually, I thought that this was a fairly elegant hack. Nonetheless, it is still a hack.


I don't really want to force us to UCS2, just because MS did. It is pretty
pointless if you think about Unicode as mean to encode every _written_
script in the world. Therefore if we want to apply any length checks, the
correct way is as specified by at least Unicode 3 e.g. UCS4.

Well, not just MS; a lot of folks (notably Sun/Java) were caught off guard when Unicode was extended beyond the base 64k characters. I won't replicate the flame wars here, they're all on Google. :-)


My personal opinion: UCS-4 wastes a lot of space given that Unicode 3.1 is a ~21-bit set and nobody is really using the >=U+10000 space in a practical manner (yet?). But if you need to have a one-to-one mapping, you don't have much choice.

Unless you have a machine which uses 21-bit bytes, of course. ;-)



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]