IndexbyWords

Ed Kleban Ed at Kleban.com
Wed Nov 16 11:30:57 CST 2005




On 11/15/05 2:18 AM, "Ruslan Zasukhin" <sunshine at public.kherson.ua> wrote:

> 
>>> Well, it is possible to check all this using V4RB exmaple
>>>     
>>>     Common/SplitToWords
>>>  

Great.  I see that this app works by using the method
    Vstring.SplitToWords( text as string)
Which does not appear to be documented in any of the manuals.
Is this a supported method we can rely on and use?

I also note in the field creation the following comment:
  
  // To be able use function SplitToWords() the field must be indexed.
  fld = tbl.CreateStringField( "fldString", 40, _
      EVFlag.fIndexed + EVFlag.fIndexByWords )
  
But I thought that fIndexed was meaningless when fIndexedByWords
was used.  Does that mean that fIndexed is not needed here?
Or is SplitToWords somehow special?

Unfortunately it turns out that "words" as defined by SplitToString include
"." and exclude "_".  The later is unfortunate for my application.  The
former seems to be unfortunate for most every application I can think of
that might want to otherwise use IndexedByWords.  It's not good for prose,
verse, or code since "foo" and "foo." and ".foo" would all parse out to be
different search terms.  That is unless a search for "foo" would return
entries that included all three of these (and other) combinations.  I'll
have to go back and read the fine print on query for IndexedByWords fields.
 
>> Is there any way to change this so that for
>> example IndexByWords includes "_" as a character, or to make SQL calls
>> directly to do an indexByWords on a field with the scan characteristics you
>> desire?
> 
> ICU allow this, but its advanced techniques. We have not study this deeply,
> And than more did not provide this feature to plugins API yet.
> 


So... I would like to request as a future enhancement that you offer some
control through the API to control how Valentina uses ICU.  Perhaps allow an
argument with a list of separators or the REGEX that gets used with ICU to
determine what is a word.

> I think, SQL REGEX can do most thing you can wish
>
>> Yeah, I'm sure it probably can.  But SQL REGEX is going to run a dynamic
>> search upon demand looking at all records, not offer me a pre-indexed
>> solution that will be able to provide an almost instant response.
> 
> right
>  
>> Is there a way to get a list of all the words that are in the index?
> 
> No. Not in V4RB.
> 
> Hey, Ed! you ask so advanced features!!!
> May be you need go down to C++ level ?
> 

No thank you.  That's what we've got you for  ;-)

> In C++ we have IndexIterator which allow iterate unique indexed values.
> 

Ah.  This is actually what I was asking about the other day, but didn't
convey to well.  I was curious if there was a way to directly access the
Index tables that V4RB creates.  This would allow you to, for example, find
out not only whether there was a record in a Hash table indexed by hash, but
in the case of duplicate records find out what is the recID within the INDEX
TABLE at which the index referring to the entry containing the hash was
found.  Thus in the case of a hash collision you could retrieve the record
referred to by the index in the Next entry of the IndexTable rather than
having to do a query that would return all records that shared the same
hash.

That make sense?   

Is the above possible in C++?

Any chance you'll ever add this interface to the RB API?


Thanks as always!
--Ed




More information about the Valentina mailing list