bytes or chars ?

Robert Brenstein rjb at robelko.com
Thu Sep 16 15:45:06 CDT 2004


>
>I think we need to be clear here; it seems to me we're discussing 
>two separate issues.  I agree with you, in fact, that it would be 
>great to always define field lengths in terms of characters -- what 
>could be clearer?  But as I understand it we *can't* do that:  there 
>*is* no way to specify an upper limit of *characters* for a UTF-8 
>field, and I don't see how there could be.  If you use UTF-8, you 
>(well, *we*, really, as developers) are going to have to address 
>those issues, as the parameters for UTF-8 are going to have to be in 
>bytes, and the users (if they're aware of such things at all) are 
>going to be thinking in characters.
>
>That's just the way things work with UTF-8; the issues, if I 
>understand correctly, are moot with UTF-16 and with single-byte 
>encodings, and we can avoid the uncertainty of UTF-8 by using some 
>other encoding.
>
>If, though, we can live with the uncertainty; if we're targeting 
>languages such as German and Polish or simply want to be open-minded 
>in terms of the characters we support; and if disk space is an 
>issue, UTF-8 is a good compromise.
>
>But what *I'm* arguing for is this:  since it's not possible to 
>describe the maximum length of UTF-8 fields in terms of characters 
>-- since we must describe them in bytes -- why not do the same for 
>other encodings?  Then we're consistent and clear.
>
>This has been a most enlightening discussion, by the way.  I am 
>always impressed by the professionalism and civility on the 
>Valentina lists, despite (or maybe because of?) the diversity of 
>platforms, development tools, previous experiences, languages, 
>cultural backgrounds, etc.
>
>-- Erik

Okay, it seems that I missed the fact that the end discussion was 
whether to have a different specification for UTF8 and others. Here 
we are in agreement. Whatever the decision is, the field limits 
should use same unit regardless of type of field or encoding.

It seems, according to another post in this renewed thread, that even 
for UTF16 the char-bytes issue is not clear and it is more messy for 
UTF8 than I thought (admitedly I haven't gotten into UTF business 
much yet but it is looming on me).

My question was whether there is really no way to stay with 
characters. I understand that this would mean more work for Ruslan 
but it sounds that otherwise many of us will be redoing the same 
char-byte count recalculation over and over in our environments.

If bytes must be for UTF8, then I resign to having bytes all over the 
place. BUT... then kernel must be able to tell me how many bytes as 
well as how many chars are actually in each field. I want also 
built-in functions that allow me to make char-byte count conversions 
for my GUI. We seem to agree that aside from storage, we will 
continue to deal with characters.

Robert


More information about the Valentina-beta mailing list