Valentina 2.0. -- What is your 3 DREAM features?

Sat Jan 31 02:18:26 CST 2004

>>> We will use UTF16 as native encoding.
>> 
>> But we can store the data as UTF-8, right (to keep the size of our
>> db's down)? It will be converted to and from UTF-16 by Valentina for
>> calculations/sorts/etc. on-the-fly, correct?
> 
> I still not sure, Jon.
> 
> Once I have see some problem with UTF8 storage.
> Do not remember now which one.
> 
> When I come to String-based indexes, we will think again on this.
> 
> Yes, dream is to store as UTD8 or some other single byte encoding,
> IF developer have told this.

If I may add my grain of salt on this, I will go further: let the user of
the database (us) choose. It could be for the whole database, i.e. not
necessarily at field level. But as always the more flexibility the better...

Here's why:

    UTF-8 takes 1 to 4 bytes
    UTF-16 takes 2 to 4 bytes
    UTF-32 takes 4 bytes

Now imagine that you have a field where you allow the user to enter up to 64
characters. Quiz: How much space you have to keep for the field in the
database for each of the 3 encodings?

    Encoding            For Americans       For International
    --------            -------------       -----------------

     UTF-8                  64                  256
     UTF-16                 128                 256
     UTF-32                 256                 256

So, as we can see, as soon as we develop a software for international, we
need to keep 4 bytes per character anyway. The only difference will be where
the blanks are put: at the end or in the middle of the string.

However, for all comparisons UTF-32 will be faster since there is no
conversion of the characters to be done. UTF-8 4 bytes character is more
costly for comparisons than UTF-16 4 bytes character which is also more
costly than UTF-32. Unfortunately, most of the Unicode implementations
support only UTF-8 and UTF-16.

So, for today the ideal is to use UTF-8 for Americans (or French, etc.) and
use UTF-16 everywhere else. Once, implementations will be added for UTF-32,
it will become useful to switch to it. UTF-32 is the only native encoding.
It has once been UTF-16 but it changed lately...

Since it is a new code base, why not start going with the flexibility...

Eric

___________________________________________________________________

 Eric Forget                       Cafederic
 ForgetE at cafederic.com             <http://www.cafederic.com/>

 Fingerprint <86D5 38F5 E1FD 5D9C 71C3  BAA3 797E 70A4 6210 C684>