bytes or chars ?
Robert Brenstein
rjb at robelko.com
Thu Sep 16 11:59:13 CDT 2004
>on 16-09-2004 4:14, Erik Mueller-Harder at
>valentina-list at vermontsoftworks.com wrote:
>
>>> What we choose?
>>>
>>> bytes in all cases ?
>>> including UTF16 ?
>>
>> As a developer, I look for consistency: consistent data are good data;
>> consistent code is good code; consistent definitions are good
>> definitions.
>>
>> Now perhaps it's important that you maintain a consistent usage with
>> other DBMSs; that's not personally important to me -- as long as a tool
>> is internally consistent, I'm happy -- but you'll need to weigh that
>> yourselves.
>>
>> But if we can refer to "bytes" in all storage encodings and thereby be
>> clear and accurate and consistent throughout Valentina, isn't that
>> truly the best answer in the end?
>>
>
>Ditto :!)
>
>
>Cool Runnings,
>Erne.
I seem to have missed a lively discussion last night. I tried to
follow it but the posts came in a sort of random order, so I can't
figure out how it went. Thus, sorry if I am resurrecting sth that was
settled.
Being consistent with other DBMSs is not relevant IMHO. There is no
1:1 conversion between databases and each database has enough
pecularities that programmers need to keep them apart anyway.
The bottom line seems to be a choice whether length specification for
string and varchar fields should be in chars or bytes. Personally, I
think it should be CHARS -- their count for a given text string stays
fixed regardless of encoding used whereas the number of bytes changes
depending on the encoding. Kernel handles all as UTF-16, so it is
always the number of chars x 2 and kernel can handle this easily self.
I understand that UTF-8 is an issue because some chars use 1 but
other 2 bytes which can offer disk space saving for some language.
However, I still think that the field limits should be set in chars.
If I deal with text, I can set/predict the size of text (char count)
but I can't as easily predict the count of bytes. Why? In English is
easy, almost always 1:1. In Russian or Japanese, not as easy but
practically 2:1. But in German or Polish, for example, it will depend
how many special characters are in the specific text.
So, if field size is in bytes, the burden of figuring the max bytes
is squarely on us. If field size is in chars, the burden is on kernel
and kernel has it as hard or may be even harder than us.
So may be we should seek a different approach. May be string fields
specified as UTF-8 should be really stored as VarChar with size 2x
the char count? This would hide the storage differences from us. Or
may be string should not allow UTF-8 encoding and thus eliminate the
problem all together.
Robert Brenstein
More information about the Valentina-beta
mailing list