bytes or chars ?

Thu Sep 16 07:20:55 CDT 2004

On Sep 16, 2004, at 05:59, Robert Brenstein wrote:

> Personally, I think it should be CHARS -- their count for a given text 
> string stays fixed regardless of encoding used whereas the number of 
> bytes changes depending on the encoding. Kernel handles all as UTF-16, 
> so it is always the number of chars x 2 and kernel can handle this 
> easily self.

Well, the kernel is utterly transparent to us and is therefore, I 
think, irrelevant to the discussion.

> I understand that UTF-8 is an issue because some chars use 1 but other 
> 2 bytes which can offer disk space saving for some language. However, 
> I still think that the field limits should be set in chars.

But setting the maximum length of a UTF-8 encoded field in characters 
simply won't work, for the very reason that you give:  some characters 
use 1 and others use 2 bytes.  It's already pretty much a given that 
UTF-8 fields need to be described in bytes -- I think the open question 
is whether fields stored in other encodings are described in bytes or 
in characters.

I believe Ruslan has said that we will at some point in the future be 
able to modify a field's storage encoding after it's been created.  It 
seems to me that this is a compelling argument for sticking with bytes 
all the time; otherwise, what would happen in the following case, with 
characters used for UTF16 and bytes for UTF8:

	myString = VString(50, "UTF16")  // 50 characters = 100 bytes in UTF16

and some weeks later this is changed/added in code:

	myString.StorageEncoding = "UTF8"

The inconsistency of using bytes in one instance and characters in 
another leads us into further uncertainty.  Obviously, Valentina would 
need to interpret the "50" as either continuing to refer to characters 
(but that wouldn't make sense, since character size is unknown in 
UTF-8) or as now referring to bytes (but that wouldn't make sense, 
since it would then have to change the field from taking up 100 bytes 
to taking up 50 -- and developers would be upset that Valentina was 
doing "sneaky" things behind their backs)....  And then we'd face the 
same difficulty in situations where UTF-8 fields were changed to 
UTF-16.

All of this is avoided if we consistently discuss bytes.  As 
developers, if we define a UTF-8 field as taking up 32 bytes (for a 
string) or a maximum of 32 bytes (for a VarChar), we know it's going to 
use 32 bytes, or a maximum of 32 bytes.  It's true that we won't know 
precisely how many characters will fit in that field, but we're used to 
that kind of uncertainty ("Let's see now:  how many characters should I 
make this new LastName field?").  Uncertainty I can cope with.  
Inconsistency is much more difficult.

-- Erik