Optimal Performance String Searching with Hashes Re: VChar vs VText

Wed Nov 30 09:08:26 CST 2005

On 11/28/05 6:59 AM, "Ruslan Zasukhin" <sunshine at public.kherson.ua> wrote:

> On 11/28/05 4:24 AM, "Ed Kleban" <Ed at Kleban.com> wrote:
> 
>>> But you cannot do range searches using hashing, right ?
>> 
>> Yes, that is correct.
>> 
>> Question: What applications do you have in mind that range searches on
>> strings would be useful for?
> 
> Strings have alphabetical order. So we must not prohibit to developers usage
> of range searches.

Yes, of course.  The use of hashing I'm talking about is for a completely
different purpose.  You certainly still need ordered string comparison,
searching, and range operations.  I'd never suggest otherwise.

> But you are right. IT is good idea to add one more option to choose:
>     ala-hashing kind of index.
> 
> You can add this as request into Mantis

Hmm, I'm not sure what you're suggesting I submit to Mantis.  I'm able to do
hashed string searches just fine right now by simply storing a parallel hash
field in a record that has string, text, or varchar field.  FindSingle et.
al. do a perfectly fine, fast job of doing lookups on them.

Amusingly, the hardest part has been finding a fast c-coded hashing function
to use.  I was delighted to see that in RB 2005 they had included a fast
hashing function that sounded exactly like what I needed.  I then discovered
that counter to the documentation the hash was not unique for Objects, and
hasn't been since 2003.  They're going to fix that in the next beta release.
I then discovered that , Doh! The hash is case-insenstive, so "text" and
"TEXT" have the same hash.  For the moment that's fine -- I just get more
collisions.  I think Bjorn is going to expose the internal hash method used
for the Dictionary inside of the Einhugur tools for us to use.

But I could certainly submit a request to Mantis for Valentina to provide a
hash function -- I'd even be delighted to supply you with the recommended
one-line of C code I use for the hash... probably just a 5-bit rotate and
Xor for each byte.

Is there some other aspect of incorporating hashes into indices somehow that
you're recommending I submit to Mantis?

>   
>> For a table of strings however, it's a whole 'nuther story.  First you have
>> the overhead for comparing strings which is much, much slower than comparing
>> integers,  On top of that as of V2 we now have the reality that strings are
>> stored in tables in Unicode and are therefore twice the size, -- but which
>> in reality does not imply twice the time since most matches will fail on a
>> comparison of the first few characters.   But the really scary part that I
>> have no clue about is that there are so many parameters you can choose from
>> for the CollationAttribute, plus the fact that you're calling into some
>> foreign package to perform those comparisons.  I have no clue what extra
>> overhead burden there is for making these comparisons, but I'm presuming the
>> answer is "at least a little bit".
> 
> Well, we use IBM library. I believe that IBM guys have did their job
> perfectly.

I'm delighted to hear it.  You're approval of their code means a great deal
to me.

> And we have no other choice. You cannot have OWN string compare method for
> each natural language.
> 

Sure.

But my whole point is that when you want to find a string fast, you use a
hash, not comparisons embedded in some string package -- no matter who wrote
it or how well.  In fact even the REALbaic "=" operator is one of the worst
possible things to use to do string comparisons.  This is simply the wrong
approach from my perspective.