[V4RB] Re: Unicode workaround via UTF-16 - problems
Ruslan Zasukhin
sunshine at public.kherson.ua
Fri Nov 7 11:09:09 CST 2003
on 11/7/03 10:23 AM, Dave Addey at dave.addey at dsl.pipex.com wrote:
> Hi Ruslan,
>
> Thanks for the response! I'm still having a few problems...
>
>> Then I think you need use VarBinary or FixedBainry strings
>
> This sounds ideal! Thanks.
>
>>> In theory, byte-based sorting on an UTF-16 string (stored and referenced as
>>> bytes) is a valid sorting process.
>>
>> I think this is not correct, Dave.
>
> You're right. 2 bytes is not enough for *all* characters in the world. And
> characters which require more than 2 bytes would "break" my sorting (as
> UTF-16 uses 4 bytes to store them, with the first 2 bytes as an identifier).
>
> But, according to IBM...
>
> "All of the most common characters in use for all modern writing systems
> are already represented with 2 bytes. Characters in surrogate space take
> 4 bytes, but as a proportion of all world text they will always be very
> rare."
>
> So I should be pretty safe :-)
>
> This quote is from an excellent article I found at:
>
> http://www-106.ibm.com/developerworks/library/utfencodingforms/
>
>> I still think this will not work for SOME HARD languages.
>
> I agree. But see the quote above :-)
Dave,
I think problem not in 4 bytes chars.
As I have read some accent chars (e,g in German a' can be expanded into 2
chars where we do sorting)
This means that your way of sorting SOMETIMES can give glitches.
--
Best regards,
Ruslan Zasukhin [ I feel the need...the need for speed ]
-------------------------------------------------------------
e-mail: ruslan at paradigmasoft.com
web: http://www.paradigmasoft.com
To subscribe to the Valentina mail list go to:
http://lists.macserve.net/mailman/listinfo/valentina
-------------------------------------------------------------
More information about the Valentina
mailing list