bytes or chars ?

jda jda at his.com
Thu Sep 16 09:14:17 CDT 2004


>
>But, Erik, I see your argument and the example you give to actually 
>support the case of using characters not bytes. I do agree that 
>consistency is essential.
>
>The way I see your argument:
>
>myString = VString(100, "UTF16")  // 100 bytes (50 chars) in UTF16
>
>If I change this to UTF-8:
>
>myString = VString(100, "UTF8")  // 100 bytes (50-100 chars) in UTF8
>
>but I probably should have
>
>myString = VString(70, "UTF8")  // 70 bytes (35-70 chars) in UTF8

Not true, Robert. UTF-8 characters can also be 3 or 4 bytes in length 
(I recall someone posting that in rare cases 5 bytes are required in 
certain rare cases).

In any case, Erik is talking about changing the encoding after the 
string was first created as UTF16. There is no chance for the 
developer to change the string buffer at this point.

>
>to take into account that not all my chars need two bytes. But may 
>it should be
>
>myString = VString(80, "UTF8")  // 80 bytes (40-80 chars) in UTF8
>
>Wherease if the size is in chars
>
>myString = VString(50, "UTF16")  // 50 chars (100 bytes) in UTF16
>myString = VString(50, "UTF8")  // 50 chars (50-100 bytes) in UTF8
>
>Let's consider also some practical aspects of your other example, 
>the uncertainty of size of lastname field.
>
>Lastname is the info that is commonly entered by users of my program 
>in a dialog or through a web form. If the field sizes are in bytes, 
>then I need to calculate the number of bytes (not chars) that user 
>entered to ensure that it is not truncated. And how do I tell them 
>that they can't enter more than 50 bytes when entering their name? 
>Users don't think in bytes but characters.

That's an issue for you front-end (e.g. RB) to deal with, not the 
database engine.

Look, this is really simple. You, the developer, decide on an 
encoding. You know the character:byte ratio for that encoding, 1:1 
(Roman), 1:2 (UTF-16), and 1:1-5 (Utf-8). You then decide how many 
bytes to allocate.

BTW, just to make things a bit more confusing, UTF-16 is no longer 
necessarily limited to 2 bytes. Again, in rare cases, more characters 
are allowed. I don't know if the IBM ICU library supports these 
languages.

Last word: what we are arguing for is *internal consistency* in 
Valentina. Having the string parameter meaning different things 
depending on the encoding is begging for trouble.

Jon


More information about the Valentina-beta mailing list