Optimal Performance String Searching with Hashes Re: VChar vs VText

Ed Kleban Ed at Kleban.com
Wed Nov 30 10:38:35 CST 2005




On 11/30/05 9:56 AM, "Ruslan Zasukhin" <sunshine at public.kherson.ua> wrote:

> On 11/30/05 5:08 PM, "Ed Kleban" <Ed at Kleban.com> wrote:
> 
>> But my whole point is that when you want to find a string fast, you use a
>> hash, not comparisons embedded in some string package -- no matter who wrote
>> it or how well.  In fact even the REALbaic "=" operator is one of the worst
>> possible things to use to do string comparisons.  This is simply the wrong
>> approach from my perspective.
> 
> I think it is good idea add Hash Index for STRING/VarChar fields which are
> marked as Primary Key.

Don't for get "Text" flavor fields stored as Blobs as well.  There's a case
where you REALLY want to use hashes because it saves you from having to page
in or search many very large strings.

> 
> Just IMHO this is bad design make keys on strings. :-)

Because if the indexing mechanism such as hashing is valuable, then the
database facilities should support it so that the user doesn't have to --
that being the whole point of a database?  Yeah, sure I can appreciate and
even support that view.

The problem is, that off the top of my head, I don't know what "add[ing]
HashIndex for STRING/VarChar fields" really means ... or why "[for] fields
which are marked as Primary Key" is relevant since I might well have
multiple string fields in a record that I want to access with hashes -- not
likely, but possible.

hmm... 

Yeah, I guess you could do this easy enough.  It raises a few questions
though..

1) What syntax do you use.  I suppose you could use EVFlags.fIsHashed on a
VarChar, String, or Text field.  As a result a separate parallel hash field
would be created in the magic secret index table land that the user never
sees.

Then FindSingle and the other Find commands just work transparently...  Yep.
Sure.  That would work fine.

The only dicey question is whether there is value in allowing the internal
hash encodings to be exposed or usable as arguments.  I guess I'd say no,
and if it really mattered you could simply implement hashes yourself
manually the way I am now.

There is one likely hitch however.  This will only work for certain
collation attributes unless you always cast a string into some normalized
form before performing the hash.   So in the case of kStrength, kPrimary for
example, you'd have to cast the string to a version with no accents and all
lowercase before calculating the hash.  But yeah, I guess that's readily
doable as well.

Ok you've convinced me.  Or we've convinced me.  This should indeed be a
feature added to V2, and can offer dramatic speed improvements to all.

I'll (eventually) submit a feature request to Manits.

 




More information about the Valentina mailing list