Optimal Performance String Searching with Hashes Re: VChar vs VText

Ed Kleban Ed at Kleban.com
Sun Nov 27 20:24:11 CST 2005




On 11/26/05 1:31 AM, "Ruslan Zasukhin" <sunshine at public.kherson.ua> wrote:

> On 11/26/05 5:35 AM, "Ed Kleban" <Ed at Kleban.com> wrote:
> 
>> Either way, I'm sure glad I'm doing my searches on string hashes rather than
>> actual strings.
> 
> Hashing can work only for search fld = "xxx"

If I understand you correctly, then yes.

> 
> But you cannot do range searches using hashing, right ?

Yes, that is correct.

You mentioned this also the last time I mentioned using Hashes, and both
time have essentially implied that there is some drawback to not having an
ability to do > or >= or < or <= searches.  I suppose that for some
applications that may be true, but it is virtually never a drawback for my
applications.  In fact I've never even considered wanting to do such a
thing.

Question: What applications do you have in mind that range searches on
strings would be useful for?



For 99% of my applications I want to be able to:

1) See if I already have a given string (or object, or structure, or
whatever) in my table.

2) If I do, then find that item, or lack thereof, as fast as possible.

3) And then typically add the item into the table if it is not already
there.

I therefore am doing this on a column of say strings for example, which I
always know will be unique.

For a table of bytes, shorts, or longs, simply performing a binary search by
using a FindSingle on an indexed unique column in Valentina is going to be
pretty darn fast.

For a table of strings however, it's a whole 'nuther story.  First you have
the overhead for comparing strings which is much, much slower than comparing
integers,  On top of that as of V2 we now have the reality that strings are
stored in tables in Unicode and are therefore twice the size, -- but which
in reality does not imply twice the time since most matches will fail on a
comparison of the first few characters.   But the really scary part that I
have no clue about is that there are so many parameters you can choose from
for the CollationAttribute, plus the fact that you're calling into some
foreign package to perform those comparisons.  I have no clue what extra
overhead burden there is for making these comparisons, but I'm presuming the
answer is "at least a little bit".

So the simple way is to include a  StringHash as Vlong (NOT as VULong)
column in the table beside your VarChar( N ) column in which you store a
hash encoding.  RealBasic now provides a very fast has for objects
(essentially the address I'm presuming) and strings (position-dependent
shift and Xor of string chars it appears) as well as all other types.  You
simply store the value (or cast it if it is an instantiated Object) in a
variant and request theVariant.hash to get RB's hash value.  When you store
a string the VarChar field of a record, you also store the hash of the
string in the VLong hash field.

To perform a fast lookup on any string, you then hash the string, do a
FindSingle call on the hashfield. If it finds none, you know it's not in the
table.  If it finds one then you still have to compare the strings to ensure
you don't have a hash collision.  If it matches, you've found it.   In the
exceedingly rare cases that it does not match, then you need to do a second
Find search to obtain an array of all hash matches, and if there is more
than one, compare full strings against all except the one you alredy
checked.

The end result is that you essentially do a binary search on a non-unique,
indexed VULong field instead of a binary search on full strings, plus a
typical worst case of a single full string comparison to verify that the
match you found is indeed the actual string.   I haven't done any metering
yet to be able to tell you how much faster this is but I'm guessing the
answer is, "quite a bit" -- especially if you are writing an application
such as mine where you are continually doing thousands of string lookups.

So this is how I use Valentina to do exact searches for strings.
If I haven't bored you to tears with this then you can continue reading my
next wish-list email in which I describe how I'd like to do inexact searches
for strings with genetic encodings rather than hashes...




More information about the Valentina mailing list