bytes or chars ?

Robert Brenstein rjb at robelko.com
Thu Sep 16 14:16:51 CDT 2004


>>I understand that UTF-8 is an issue because some chars use 1 but 
>>other 2 bytes which can offer disk space saving for some language. 
>>However, I still think that the field limits should be set in chars.
>
>But setting the maximum length of a UTF-8 encoded field in 
>characters simply won't work, for the very reason that you give: 
>some characters use 1 and others use 2 bytes.  It's already pretty 
>much a given that UTF-8 fields need to be described in bytes -- I 
>think the open question is whether fields stored in other encodings 
>are described in bytes or in characters.
>
>I believe Ruslan has said that we will at some point in the future 
>be able to modify a field's storage encoding after it's been 
>created.  It seems to me that this is a compelling argument for 
>sticking with bytes all the time; otherwise, what would happen in 
>the following case, with characters used for UTF16 and bytes for 
>UTF8:
>
>	myString = VString(50, "UTF16")  // 50 characters = 100 bytes in UTF16
>
>and some weeks later this is changed/added in code:
>
>	myString.StorageEncoding = "UTF8"
>
>The inconsistency of using bytes in one instance and characters in 
>another leads us into further uncertainty.  Obviously, Valentina 
>would need to interpret the "50" as either continuing to refer to 
>characters (but that wouldn't make sense, since character size is 
>unknown in UTF-8) or as now referring to bytes (but that wouldn't 
>make sense, since it would then have to change the field from taking 
>up 100 bytes to taking up 50 -- and developers would be upset that 
>Valentina was doing "sneaky" things behind their backs)....  And 
>then we'd face the same difficulty in situations where UTF-8 fields 
>were changed to UTF-16.
>
>All of this is avoided if we consistently discuss bytes.  As 
>developers, if we define a UTF-8 field as taking up 32 bytes (for a 
>string) or a maximum of 32 bytes (for a VarChar), we know it's going 
>to use 32 bytes, or a maximum of 32 bytes.  It's true that we won't 
>know precisely how many characters will fit in that field, but we're 
>used to that kind of uncertainty ("Let's see now:  how many 
>characters should I make this new LastName field?").  Uncertainty I 
>can cope with.  Inconsistency is much more difficult.
>
>-- Erik

But, Erik, I see your argument and the example you give to actually 
support the case of using characters not bytes. I do agree that 
consistency is essential.

The way I see your argument:

myString = VString(100, "UTF16")  // 100 bytes (50 chars) in UTF16

If I change this to UTF-8:

myString = VString(100, "UTF8")  // 100 bytes (50-100 chars) in UTF8

but I probably should have

myString = VString(70, "UTF8")  // 70 bytes (35-70 chars) in UTF8

to take into account that not all my chars need two bytes. But may it should be

myString = VString(80, "UTF8")  // 80 bytes (40-80 chars) in UTF8

Wherease if the size is in chars

myString = VString(50, "UTF16")  // 50 chars (100 bytes) in UTF16
myString = VString(50, "UTF8")  // 50 chars (50-100 bytes) in UTF8

Let's consider also some practical aspects of your other example, the 
uncertainty of size of lastname field.

Lastname is the info that is commonly entered by users of my program 
in a dialog or through a web form. If the field sizes are in bytes, 
then I need to calculate the number of bytes (not chars) that user 
entered to ensure that it is not truncated. And how do I tell them 
that they can't enter more than 50 bytes when entering their name? 
Users don't think in bytes but characters.

Well, a name field will likely be set to be large enough but other 
input fields may have real restrictions. For example, my biggest 
application of Valentina is a content management system which has 
quite a few free-entry fields wit upper limits on text. I will 
probably resolve to using UTF16 to keep my life simple but if I 
wanted to use UTF8 to save disk space, my having to deal with bytes 
whereas my users thinking in characters would be a royal pain.

Robert


More information about the Valentina-beta mailing list