bytes or chars ?
jda
jda at his.com
Thu Sep 16 09:14:17 CDT 2004
>
>But, Erik, I see your argument and the example you give to actually
>support the case of using characters not bytes. I do agree that
>consistency is essential.
>
>The way I see your argument:
>
>myString = VString(100, "UTF16") // 100 bytes (50 chars) in UTF16
>
>If I change this to UTF-8:
>
>myString = VString(100, "UTF8") // 100 bytes (50-100 chars) in UTF8
>
>but I probably should have
>
>myString = VString(70, "UTF8") // 70 bytes (35-70 chars) in UTF8
Not true, Robert. UTF-8 characters can also be 3 or 4 bytes in length
(I recall someone posting that in rare cases 5 bytes are required in
certain rare cases).
In any case, Erik is talking about changing the encoding after the
string was first created as UTF16. There is no chance for the
developer to change the string buffer at this point.
>
>to take into account that not all my chars need two bytes. But may
>it should be
>
>myString = VString(80, "UTF8") // 80 bytes (40-80 chars) in UTF8
>
>Wherease if the size is in chars
>
>myString = VString(50, "UTF16") // 50 chars (100 bytes) in UTF16
>myString = VString(50, "UTF8") // 50 chars (50-100 bytes) in UTF8
>
>Let's consider also some practical aspects of your other example,
>the uncertainty of size of lastname field.
>
>Lastname is the info that is commonly entered by users of my program
>in a dialog or through a web form. If the field sizes are in bytes,
>then I need to calculate the number of bytes (not chars) that user
>entered to ensure that it is not truncated. And how do I tell them
>that they can't enter more than 50 bytes when entering their name?
>Users don't think in bytes but characters.
That's an issue for you front-end (e.g. RB) to deal with, not the
database engine.
Look, this is really simple. You, the developer, decide on an
encoding. You know the character:byte ratio for that encoding, 1:1
(Roman), 1:2 (UTF-16), and 1:1-5 (Utf-8). You then decide how many
bytes to allocate.
BTW, just to make things a bit more confusing, UTF-16 is no longer
necessarily limited to 2 bytes. Again, in rare cases, more characters
are allowed. I don't know if the IBM ICU library supports these
languages.
Last word: what we are arguing for is *internal consistency* in
Valentina. Having the string parameter meaning different things
depending on the encoding is begging for trouble.
Jon
More information about the Valentina-beta
mailing list