[V4RB] Re: Unicode workaround via UTF-16 - problems

Dave Addey dave.addey at dsl.pipex.com
Fri Nov 7 10:22:10 CST 2003


Hi Ruslan,

> Dave,
> 
> I think problem not in 4 bytes chars.
> As I have read some accent chars (e,g in German a' can be expanded into 2
> chars where we do sorting)
> 
> This means that your way of sorting SOMETIMES can give glitches.

Oh, I see what you mean.  Hmm.  Well, I'll give it a go, and see how I get
on - i.e. See if the sorting is good *enough* until v2.0 of Valentina :-)

>> So, I've switched to using VVarBinary rather than VVarChar for fields where
>> I want to store my UTF-16 data.  My problem comes when I try and sort on
>> these fields.
> 
> Ops, it seems I have give you wrong info.
> Binary can fields CAN store data.
> 
> But since this is BINARY fields, Valentina DO NOT index then, and there fore
> cannot sort. Because sort algorithm of Valentina use index.
> 
> Hmm, then I afraid we do not have a way.
> String field cannot store data with ZERO inside.
> Binary cannot sort...

Drat!  Well, it looks like my only other option then is to encode the UTF-16
strings before I put them in the database.  Let me think about this...

I guess that if I'm being inaccurate about my sorting (and "living" with
this fact for now), then I could just use UTF-8 instead (which doesn't
contain these zero bytes).  UTF-8 uses qualifiers for two-byte characters,
so strings which contained these wouldn't sort properly.  But, if I'm happy
to live with sorting where two-byte chars appear *after* one-byte chars in
all cases, then this isn't a problem.  It makes my sorting less accurate,
but maybe I can live with this for now!

Alternatively, I could encode my UTF-16 strings so that the zero-byte
character isn't there.  Is it *just* zero-byte (&h00) that isn't allowed in
Valentina strings?  If so, I could replace all these with a Unicode control
character such as &hFF (as I know the string is Unicode, and I created the
string, so the control characters won't appear in the string otherwise.)
Are there any bytes other than &h00 that aren't allowed in VVarChar?

Third option: I encode *all* bytes of my UTF-16 string into a longer format
(e.g. replace each byte with its hex equivalent, such that &h00 becomes
"00".  This would double my storage requirements for these strings
(actually, the multiplier is x4 because I'm encoding UTF-16), but would
definitely work!

Thanks in advance for your help,

Dave.



More information about the Valentina mailing list