Genetic string searches Re: VChar vs VText

Ed Kleban Ed at Kleban.com
Sun Nov 27 21:08:48 CST 2005




On 11/26/05 1:31 AM, "Ruslan Zasukhin" <sunshine at public.kherson.ua> wrote:

> On 11/26/05 5:35 AM, "Ed Kleban" <Ed at Kleban.com> wrote:
> 
>> Either way, I'm sure glad I'm doing my searches on string hashes rather than
>> actual strings.
> 
> Hashing can work only for search fld = "xxx"
> 
> But you cannot do range searches using hashing, right ?

Correct.  But now I'd like elaborate on an earlier note I made about a
feature I'd like to see:

Of all the features on my wish-list for future Valentina versions, top of
the list would be some VField.FindMasked functions; perhaps something like:

VField.FindMasked( 
    inMask as Integer, inKeep as Integer, inSelection as VSet )
    as VBitSet

VField.FindMaskedAsArraySet(
    inMask as Integer, inKeep as Integer, inSelection as VSet  )
    as VArraySet

VField.FindFirstMasked(
    inMask as Integer, inKeep as Integer, inSelection as VSet )
     as Integer // Returns RecID

High-order bits of the inMask argument would be truncated to the size of the
VField.  The inKeep argument would be one of:

EVKeep.includeExactMatches
EVKeep.excludeExactMatches
EVKeep.includePartialMatches
EVKeep.excludePartialMatches

The functions would simply perform a BitAnd between the field value and
inMask.  If the result was equal to inMask it would qualify as an exact
match. If the result was non-zero it would qualify as a partial match.

This would allow you to do some very sophisticated match processing that
could greatly improve the performance of many algorithms.  Let's take an
example where we want to perform a very rapid search among a collection of
arbitrary strings that have been "interned" (stored uniquely) in a column.
Perhaps we already have a hash field as well for performing rapid
exact-match lookups.  But what if we want to rapidly find all words that
include the letters "bet" and therefore want to match "Alphabet", "Better",
and the phrase "You Betcha" -- all of which are included in the collection.

One could certainly use the regular expression matching facility, and
eventually come up with a list of candidates.  But it would take an awful
lot of processing to come up with that list.  A vastly faster way would be
to include a genetic encoding of every interned string as another field.

This would likely be an integer field, probably VULong for optimal
flexibility and performance.  A simplistic encoding might be to simply
lowercase the word and then allocate 26 of the 32 bits in the VULong for
each of the characters of the alphabet.  Thus when we call FindMasked our
inMask value would have bits set for the "b", "e", and "t" characters.

The list we get back from a such FindMasked search would have some good
candidates like "better" and some bad candidates in the list like "tabbed"
which also has "b", "e", and "t" in its genetic encoding.   We'd still have
to run the result through a regular expression search or examine them in
some other manner to cull out the bad candidates, but the list to cull would
be a small fraction of the table contents.  With a more sophisticated
encoding that included aspects of string length, presences of whitespace and
usage of punctuatation the list retuned by FindMasked could be an
EXCEEDINGLY small fraction.  One could even include a second VULong field
with a different genetic encoding to perform a second-pass culling using
another application of a FindMasked function call.

What ya think?

 




More information about the Valentina mailing list