UTF8 storage (was Re: V4RB, Jon, project)
Erik Mueller-Harder
valentina-list at vermontsoftworks.com
Wed Sep 15 09:47:24 CDT 2004
Hi, Ruslan & others --
I'm glad to see all the specifics re UTF8/16 storage detailed here.
The inheritance model is great, Ruslan -- such a straightforward way to
take care of general rules and the occasional exception. Thank you for
such an elegant solution.
I think I understand the complications of UTF8 storage on your end of
things, and I think I see that you might be tempted to limit or even
eliminate its support:
> At last of end, why we use unicode?
> To be able store any language.
This is quite true, of course. And, as a U.S. developer who has been
railing against the general lack of world-awareness of many U.S.
developers for decades, I have been a long-standing supporter of
Unicode and its universal adoption.
That said, I currently find myself in the ironic position of developing
an "in-house" application which will have virtually no chance of being
used outside the scope of countries using Latin-based alphabets, so
UTF16 would be definite overkill -- essentially doubling the amount of
space my database would require.
Why shouldn't I use Latin1 or MacRoman, then? Because I recognize that
the usable life of these character sets is (thankfully!) limited;
because I want to be as cross-platform compatible as possible; and
because it's clear to me that Unicode really should be used for
everything. But mostly, I don't want to limit users to the 128 or 256
characters that I can be relatively certain will be supported
more-or-less properly by an old-world character set: I need and want
to support the *occasional* use of Latin characters with diacritical
marks -- including those we associate not just with French and Spanish
(generally supported OK by MacRoman and Latin1), but also with
Hungarian, Norwegian, and other Latin-based alphabets. UTF8 is really
perfect in such circumstances -- and it gets me even more because
essentially *all* writing systems are supported.
Yes, I do understand that my space usage in text fields will increase
when I store certain characters, and I understand that using multi-byte
characters in VarChar and String fields uses their available space more
quickly than using single-byte characters. In the case of VarChar, I'm
not terribly concerned: I define almost all VarChar fields as 504
bytes even in situations where I expect the average field to be only 25
or 30 characters, since that definition doesn't cost me anything and
gains me flexibility. If a half-dozen (or even *all*!) of those
characters are multi-byte -- no difference, even in UTF8. Not a
problem! All I have to do is remember to "round up" if I'm ever on the
fence about how much space to allocate (something I'm likely to do
anyway).
I can see that the potential of using multi-byte characters in String
fields defined as UTF8 storage could be problematic. I tend to define
very few String fields, though -- pretty much only for situations like
product codes and so forth, where I can be certain of an exact length.
These are situations where multi-byte characters are unlikely to be
needed. I'm supposing that my best plan is probably keep these defined
as UTF8, but to define the language as ASCII and edit input to ensure
compliance.
So, in short, I hope that you continue to support storage in UTF8 for
all string types. In many cases, of course, UTF8 is not the best
solution -- but there certainly are situations in which it makes a
great deal of sense.
Thanks again for all you're doing. Valentina 1.x has been a great
product; 2.x is promising to be truly outstanding!
-- Erik
More information about the Valentina-beta
mailing list