Unicode and length of field
john roberts
jarobe01 at athena.louisville.edu
Mon May 12 12:52:03 CDT 2003
on 5/12/03 12:10 PM, Ruslan Zasukhin at sunshine at public.kherson.ua wrote:
> I just think aloud:
Me, too.
>
> -- assume we have String[N] / VarChar[N] field.
> up to now we have consider Len as the maximum number of characters.
>
> -- for UTF8 String[N] / VarChar[N]
> N will play role of length in bytes.
> Deal is that if we store here chars >127 then they can take 2 bytes.
Actually, some more than 2 bytes.
> but some chars still can use one byte.
> We get mess...
Only if we try to tie number of characters and bytes used too tightly. We
need to recognize that there is a "fuzzy" relationship but that is all.
>
> the only way -- consider N as length in bytes.
Absolutely. Let us adjust for the encodings as needed.
> Valentina also add one byte to keep END ZERO character.
OK; I think that is what you are currently doing, isn't it ?
>
> It is bad idea try to consider this characters.
> Because in this way, Valentina should assume that if you say
> String[30], then it must allocate 60 bytes, to be able store worse case.
> but if you will store MacRoman text then you simply loose 30 bytes.
> Of course this is bad, developer must be able control each bit
> in database.
Yes. Please do not do any automatic adjusting for us. Let determine the
impact of a particular encoding that we may be using.
>
> -- for UTF16, all chars take 2 bytes always.
> So for such field, if we say String[30],
> Valentina is able allocate (30 + 1) * 2 = 62 bytes
> but then N again start means characters...
I seem to remember Joe S. mentioning something about UTF16 using 2-bytes per
character "always except ..." and I don't recall the "..." part.
>
>
> So we have 2 ways:
>
> 1) consider all as I have describe above. In this case UTF8 is EXCEPTION
> from rule.
UTF16 is more consistent but still has its problems.
>
> 2) To make things more consistence, we can require for UTF16 length also in
> bytes. Then if you need 30 chars you say String[60] and Valentina self add 2
> bytes for END ZERO.
> In this way both Unicodes, UTF8 and UTF16 play by the same rules, that
> differ from old Strings.
As a user/developer I would prefer that string[n] refer to n-bytes
regardless of the encoding and let us determine how many bytes we need to do
the job.
Thank you,
John Roberts
More information about the Valentina
mailing list