UTF8 storage (was Re: V4RB, Jon, project)

Wed Sep 15 09:47:24 CDT 2004

Hi, Ruslan & others --

I'm glad to see all the specifics re UTF8/16 storage detailed here.  
The inheritance model is great, Ruslan -- such a straightforward way to 
take care of general rules and the occasional exception.  Thank you for 
such an elegant solution.

I think I understand the complications of UTF8 storage on your end of 
things, and I think I see that you might be tempted to limit or even 
eliminate its support:

> At last of end, why we use unicode?
> To be able store any language.

This is quite true, of course.  And, as a U.S. developer who has been 
railing against the general lack of world-awareness of many U.S. 
developers for decades, I have been a long-standing supporter of 
Unicode and its universal adoption.

That said, I currently find myself in the ironic position of developing 
an "in-house" application which will have virtually no chance of being 
used outside the scope of countries using Latin-based alphabets, so 
UTF16 would be definite overkill -- essentially doubling the amount of 
space my database would require.

Why shouldn't I use Latin1 or MacRoman, then?  Because I recognize that 
the usable life of these character sets is (thankfully!) limited; 
because I want to be as cross-platform compatible as possible; and 
because it's clear to me that Unicode really should be used for 
everything.  But mostly, I don't want to limit users to the 128 or 256 
characters that I can be relatively certain will be supported 
more-or-less properly by an old-world character set:  I need and want 
to support the *occasional* use of Latin characters with diacritical 
marks -- including those we associate not just with French and Spanish 
(generally supported OK by MacRoman and Latin1), but also with 
Hungarian, Norwegian, and other Latin-based alphabets.  UTF8 is really 
perfect in such circumstances -- and it gets me even more because 
essentially *all* writing systems are supported.

Yes, I do understand that my space usage in text fields will increase 
when I store certain characters, and I understand that using multi-byte 
characters in VarChar and String fields uses their available space more 
quickly than using single-byte characters.  In the case of VarChar, I'm 
not terribly concerned:  I define almost all VarChar fields as 504 
bytes even in situations where I expect the average field to be only 25 
or 30 characters, since that definition doesn't cost me anything and 
gains me flexibility.  If a half-dozen (or even *all*!) of those 
characters are multi-byte -- no difference, even in UTF8.  Not a 
problem!  All I have to do is remember to "round up" if I'm ever on the 
fence about how much space to allocate (something I'm likely to do 
anyway).

I can see that the potential of using multi-byte characters in String 
fields defined as UTF8 storage could be problematic.  I tend to define 
very few String fields, though -- pretty much only for situations like 
product codes and so forth, where I can be certain of an exact length.  
These are situations where multi-byte characters are unlikely to be 
needed.  I'm supposing that my best plan is probably keep these defined 
as UTF8, but to define the language as ASCII and edit input to ensure 
compliance.

So, in short, I hope that you continue to support storage in UTF8 for 
all string types.  In many cases, of course, UTF8 is not the best 
solution -- but there certainly are situations in which it makes a 
great deal of sense.

Thanks again for all you're doing.  Valentina 1.x has been a great 
product; 2.x is promising to be truly outstanding!

-- Erik