Optimal Performance String Searching with Hashes Re: VChar vs
VText
Ed Kleban
Ed at Kleban.com
Wed Nov 30 09:08:26 CST 2005
On 11/28/05 6:59 AM, "Ruslan Zasukhin" <sunshine at public.kherson.ua> wrote:
> On 11/28/05 4:24 AM, "Ed Kleban" <Ed at Kleban.com> wrote:
>
>>> But you cannot do range searches using hashing, right ?
>>
>> Yes, that is correct.
>>
>> Question: What applications do you have in mind that range searches on
>> strings would be useful for?
>
> Strings have alphabetical order. So we must not prohibit to developers usage
> of range searches.
Yes, of course. The use of hashing I'm talking about is for a completely
different purpose. You certainly still need ordered string comparison,
searching, and range operations. I'd never suggest otherwise.
> But you are right. IT is good idea to add one more option to choose:
> ala-hashing kind of index.
>
> You can add this as request into Mantis
Hmm, I'm not sure what you're suggesting I submit to Mantis. I'm able to do
hashed string searches just fine right now by simply storing a parallel hash
field in a record that has string, text, or varchar field. FindSingle et.
al. do a perfectly fine, fast job of doing lookups on them.
Amusingly, the hardest part has been finding a fast c-coded hashing function
to use. I was delighted to see that in RB 2005 they had included a fast
hashing function that sounded exactly like what I needed. I then discovered
that counter to the documentation the hash was not unique for Objects, and
hasn't been since 2003. They're going to fix that in the next beta release.
I then discovered that , Doh! The hash is case-insenstive, so "text" and
"TEXT" have the same hash. For the moment that's fine -- I just get more
collisions. I think Bjorn is going to expose the internal hash method used
for the Dictionary inside of the Einhugur tools for us to use.
But I could certainly submit a request to Mantis for Valentina to provide a
hash function -- I'd even be delighted to supply you with the recommended
one-line of C code I use for the hash... probably just a 5-bit rotate and
Xor for each byte.
Is there some other aspect of incorporating hashes into indices somehow that
you're recommending I submit to Mantis?
>
>> For a table of strings however, it's a whole 'nuther story. First you have
>> the overhead for comparing strings which is much, much slower than comparing
>> integers, On top of that as of V2 we now have the reality that strings are
>> stored in tables in Unicode and are therefore twice the size, -- but which
>> in reality does not imply twice the time since most matches will fail on a
>> comparison of the first few characters. But the really scary part that I
>> have no clue about is that there are so many parameters you can choose from
>> for the CollationAttribute, plus the fact that you're calling into some
>> foreign package to perform those comparisons. I have no clue what extra
>> overhead burden there is for making these comparisons, but I'm presuming the
>> answer is "at least a little bit".
>
> Well, we use IBM library. I believe that IBM guys have did their job
> perfectly.
I'm delighted to hear it. You're approval of their code means a great deal
to me.
> And we have no other choice. You cannot have OWN string compare method for
> each natural language.
>
Sure.
But my whole point is that when you want to find a string fast, you use a
hash, not comparisons embedded in some string package -- no matter who wrote
it or how well. In fact even the REALbaic "=" operator is one of the worst
possible things to use to do string comparisons. This is simply the wrong
approach from my perspective.
More information about the Valentina
mailing list