Unicode and length of field

john roberts jarobe01 at athena.louisville.edu
Mon May 12 12:52:03 CDT 2003


on 5/12/03 12:10 PM, Ruslan Zasukhin at sunshine at public.kherson.ua wrote:

> I just think aloud:

Me, too.
> 
> -- assume we have String[N] / VarChar[N] field.
> up to now we have consider Len as the maximum number of characters.
> 
> -- for UTF8 String[N] / VarChar[N]
> N will play role of length in bytes.
> Deal is that if we store here chars >127 then they can take 2 bytes.

Actually, some more than 2 bytes.

> but some chars still can use one byte.
> We get mess...

Only if we try to tie number of characters and bytes used too tightly. We
need to recognize that there is a "fuzzy" relationship but that is all.
> 
> the only way -- consider N as length in bytes.

Absolutely. Let us adjust for the encodings as needed.

> Valentina also add one byte to keep END ZERO character.

OK; I think that is what you are currently doing, isn't it ?

> 
> It is bad idea try to consider this characters.
> Because in this way, Valentina should assume that if you say
> String[30], then it must allocate 60 bytes, to be able store worse case.
> but if you will store MacRoman text then you simply loose 30 bytes.
> Of course this is bad, developer must be able control each bit
> in database.

Yes. Please do not do any automatic adjusting for us. Let determine the
impact of a particular encoding that we may be using.
> 
> -- for UTF16, all chars take 2 bytes always.
> So for such field, if we say String[30],
> Valentina is able allocate (30 + 1) * 2 = 62 bytes
> but then N again start means characters...

I seem to remember Joe S. mentioning something about UTF16 using 2-bytes per
character "always except ..." and I don't recall the "..." part.
> 
> 
> So we have 2 ways:
> 
> 1) consider all as I have describe above. In this case UTF8 is EXCEPTION
> from rule.

UTF16 is more consistent but still has its problems.
> 
> 2) To make things more consistence, we can require for UTF16 length also in
> bytes. Then if you need 30 chars you say String[60] and Valentina self add 2
> bytes for END ZERO.
> In this way both Unicodes, UTF8 and UTF16 play by the same rules, that
> differ from old Strings.

As a user/developer I would prefer that string[n] refer to n-bytes
regardless of the encoding and let us determine how many bytes we need to do
the job.

Thank you,
John Roberts



More information about the Valentina mailing list