bytes or chars ?
jda
jda at his.com
Thu Sep 16 17:13:51 CDT 2004
>>Not true, Robert. UTF-8 characters can also be 3 or 4 bytes in
>>length (I recall someone posting that in rare cases 5 bytes are
>>required in certain rare cases).
>AFAIK (from the ICU User's Guide) the maximum possible len of a
>character in UTF-8 is 4 bytes. Can you give an example when 5 bytes
>are needed?
>
I did a google search and found this tidbit on a wikipedia.org site:
----------
(An earlier UTF-8 specification allowed even higher code points to be
represented, using 5 or 6 bytes, but this is no longer supported.)
----------
Also, from the unicode.org site
(http://www.unicode.org/versions/corrigendum1.html) I found this:
----------
The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also
allows for the use of five- and six-byte sequences to encode
characters that are outside the range of the Unicode character set;
those five- and six-byte sequences are illegal for the use of UTF-8
as a transformation of Unicode characters. ISO/IEC 10646 does not
allow mapping of unpaired surrogates, nor U+FFFE and U+FFFF (but it
does allow other noncharacters).
----------
So I guess it is/was a possibility to use more than 4 bytes for a
UTF-8 character, but it is either not supported or not used (or
both). Nothing to worry about, obviously.
Jon
More information about the Valentina-beta
mailing list