bytes or chars ?
Erik Mueller-Harder
valentina-list at vermontsoftworks.com
Thu Sep 16 09:09:02 CDT 2004
On Sep 16, 2004, at 08:16, Robert Brenstein wrote:
> Lastname is the info that is commonly entered by users of my program
> in a dialog or through a web form. If the field sizes are in bytes,
> then I need to calculate the number of bytes (not chars) that user
> entered to ensure that it is not truncated. And how do I tell them
> that they can't enter more than 50 bytes when entering their name?
> Users don't think in bytes but characters.
>
> Well, a name field will likely be set to be large enough but other
> input fields may have real restrictions. For example, my biggest
> application of Valentina is a content management system which has
> quite a few free-entry fields wit upper limits on text.
I think we need to be clear here; it seems to me we're discussing two
separate issues. I agree with you, in fact, that it would be great to
always define field lengths in terms of characters -- what could be
clearer? But as I understand it we *can't* do that: there *is* no way
to specify an upper limit of *characters* for a UTF-8 field, and I
don't see how there could be. If you use UTF-8, you (well, *we*,
really, as developers) are going to have to address those issues, as
the parameters for UTF-8 are going to have to be in bytes, and the
users (if they're aware of such things at all) are going to be thinking
in characters.
That's just the way things work with UTF-8; the issues, if I understand
correctly, are moot with UTF-16 and with single-byte encodings, and we
can avoid the uncertainty of UTF-8 by using some other encoding.
If, though, we can live with the uncertainty; if we're targeting
languages such as German and Polish or simply want to be open-minded in
terms of the characters we support; and if disk space is an issue,
UTF-8 is a good compromise.
But what *I'm* arguing for is this: since it's not possible to
describe the maximum length of UTF-8 fields in terms of characters --
since we must describe them in bytes -- why not do the same for other
encodings? Then we're consistent and clear.
This has been a most enlightening discussion, by the way. I am always
impressed by the professionalism and civility on the Valentina lists,
despite (or maybe because of?) the diversity of platforms, development
tools, previous experiences, languages, cultural backgrounds, etc.
-- Erik
More information about the Valentina-beta
mailing list