bytes or chars ?

Erik Mueller-Harder valentina-list at vermontsoftworks.com
Thu Sep 16 09:09:02 CDT 2004


On Sep 16, 2004, at 08:16, Robert Brenstein wrote:

> Lastname is the info that is commonly entered by users of my program 
> in a dialog or through a web form. If the field sizes are in bytes, 
> then I need to calculate the number of bytes (not chars) that user 
> entered to ensure that it is not truncated. And how do I tell them 
> that they can't enter more than 50 bytes when entering their name? 
> Users don't think in bytes but characters.
>
> Well, a name field will likely be set to be large enough but other 
> input fields may have real restrictions. For example, my biggest 
> application of Valentina is a content management system which has 
> quite a few free-entry fields wit upper limits on text.

I think we need to be clear here; it seems to me we're discussing two 
separate issues.  I agree with you, in fact, that it would be great to 
always define field lengths in terms of characters -- what could be 
clearer?  But as I understand it we *can't* do that:  there *is* no way 
to specify an upper limit of *characters* for a UTF-8 field, and I 
don't see how there could be.  If you use UTF-8, you (well, *we*, 
really, as developers) are going to have to address those issues, as 
the parameters for UTF-8 are going to have to be in bytes, and the 
users (if they're aware of such things at all) are going to be thinking 
in characters.

That's just the way things work with UTF-8; the issues, if I understand 
correctly, are moot with UTF-16 and with single-byte encodings, and we 
can avoid the uncertainty of UTF-8 by using some other encoding.

If, though, we can live with the uncertainty; if we're targeting 
languages such as German and Polish or simply want to be open-minded in 
terms of the characters we support; and if disk space is an issue, 
UTF-8 is a good compromise.

But what *I'm* arguing for is this:  since it's not possible to 
describe the maximum length of UTF-8 fields in terms of characters -- 
since we must describe them in bytes -- why not do the same for other 
encodings?  Then we're consistent and clear.

This has been a most enlightening discussion, by the way.  I am always 
impressed by the professionalism and civility on the Valentina lists, 
despite (or maybe because of?) the diversity of platforms, development 
tools, previous experiences, languages, cultural backgrounds, etc.

-- Erik



More information about the Valentina-beta mailing list