bytes or chars ?

Robert Brenstein rjb at robelko.com
Thu Sep 16 11:59:13 CDT 2004


>on 16-09-2004 4:14, Erik Mueller-Harder at
>valentina-list at vermontsoftworks.com wrote:
>
>>>  What we choose?
>>>
>>>  bytes in all cases ?
>>>  including UTF16 ?
>>
>>  As a developer, I look for consistency:  consistent data are good data;
>>  consistent code is good code; consistent definitions are good
>>  definitions.
>>
>>  Now perhaps it's important that you maintain a consistent usage with
>>  other DBMSs; that's not personally important to me -- as long as a tool
>>  is internally consistent, I'm happy -- but you'll need to weigh that
>>  yourselves.
>>
>>  But if we can refer to "bytes" in all storage encodings and thereby be
>>  clear and accurate and consistent throughout Valentina, isn't that
>>  truly the best answer in the end?
>>
>
>Ditto :!)
>
>
>Cool Runnings,
>Erne.

I seem to have missed a lively discussion last night. I tried to 
follow it but the posts came in a sort of random order, so I can't 
figure out how it went. Thus, sorry if I am resurrecting sth that was 
settled.

Being consistent with other DBMSs is not relevant IMHO. There is no 
1:1 conversion between databases and each database has enough 
pecularities that programmers need to keep them apart anyway.

The bottom line seems to be a choice whether length specification for 
string and varchar fields should be in chars or bytes. Personally, I 
think it should be CHARS -- their count for a given text string stays 
fixed regardless of encoding used whereas the number of bytes changes 
depending on the encoding. Kernel handles all as UTF-16, so it is 
always the number of chars x 2 and kernel can handle this easily self.

I understand that UTF-8 is an issue because some chars use 1 but 
other 2 bytes which can offer disk space saving for some language. 
However, I still think that the field limits should be set in chars.

If I deal with text, I can set/predict the size of text (char count) 
but I can't as easily predict the count of bytes. Why? In English is 
easy, almost always 1:1. In Russian or Japanese, not as easy but 
practically 2:1. But in German or Polish, for example, it will depend 
how many special characters are in the specific text.

So, if field size is in bytes, the burden of figuring the max bytes 
is squarely on us. If field size is in chars, the burden is on kernel 
and kernel has it as hard or may be even harder than us.

So may be we should seek a different approach. May be string fields 
specified as UTF-8 should be really stored as VarChar with size 2x 
the char count? This would hide the storage differences from us. Or 
may be string should not allow UTF-8 encoding and thus eliminate the 
problem all together.

Robert Brenstein


More information about the Valentina-beta mailing list