bytes or chars ?

jda jda at his.com
Thu Sep 16 17:13:51 CDT 2004


>>Not true, Robert. UTF-8 characters can also be 3 or 4 bytes in 
>>length (I recall someone posting that in rare cases 5 bytes are 
>>required in certain rare cases).
>AFAIK (from the ICU User's Guide) the maximum possible len of a 
>character in UTF-8 is 4 bytes. Can you give an example when 5 bytes 
>are needed?
>

I did a google search and found this tidbit on a wikipedia.org site:

----------
(An earlier UTF-8 specification allowed even higher code points to be 
represented, using 5 or 6 bytes, but this is no longer supported.)
----------

Also, from the unicode.org site 
(http://www.unicode.org/versions/corrigendum1.html)  I found this:

----------
The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also 
allows for the use of five- and six-byte sequences to encode 
characters that are outside the range of the Unicode character set; 
those five- and six-byte sequences are illegal for the use of UTF-8 
as a transformation of Unicode characters. ISO/IEC 10646 does not 
allow mapping of unpaired surrogates, nor U+FFFE and U+FFFF (but it 
does allow other  noncharacters).
----------

So I guess it is/was a possibility to use more than 4 bytes for a 
UTF-8 character, but it is either not supported or not used (or 
both). Nothing to worry about, obviously.

Jon


More information about the Valentina-beta mailing list