DRAFT of specification if Indexing By words for strings.

Ruslan Zasukhin sunshine at public.kherson.ua
Thu Sep 23 07:19:20 CDT 2004


On 9/23/04 5:33 AM, "Erik Mueller-Harder"
<valentina-list at vermontsoftworks.com> wrote:

> On Sep 22, 2004, at 16:10, Ruslan Zasukhin wrote:
> 
>> IBM ICU have class BreakIterator
>> 
>> It make sure that for specified language we will get the correct words.
> 
> Excellent; I'm sure that will take care of at least 95% of our needs.
> 
> I do wonder if there could be someway to override it, though.  I could
> see an override being useful, for example, in the case of *last names*
> (here we go again!).  A hyphen character ("-") is almost always
> considered a word boundary.  Occasionally, however, there have been
> times when I'd have liked to be able to treat hyphenated words as
> one-word sets in an indexed-by-word field (last-name fields come to
> mind!).
> 
> Perhaps you could provide an interface similar to what we've been
> discussing re prohibited index words, letting us add characters to
> word-break sets and to non-word-break sets as needed?

I see, Erik

Very soon we will start work on index by words implementation,
So we will see if they allow tune.
I think they do.

IF they do, then yes, we must be able provide some API to control this.


> Also, there may be situations where a field defined as, for example,
> English may contain words from other languages, in which case the IBM
> ICU might cause the word "rechêrche" (for example) to be indexed as
> "rech" and "rche," as (I believe) the Valentina 1.x kernel does
> (certainly it does this with many "accent" characters).  If we could
> have the capability of adding "ê" (in this case) to a set of word-break
> characters, we could fine-tune the ICU as appropriate for our specific
> applications.
> 
> Do others see a need for this, too, or is this my own pipe-dream?

I believe that if you will mix languages, you simply MUST to use unicode.
And then ICU must be able recognize words.

-- 
Best regards,
Ruslan Zasukhin      [ I feel the need...the need for speed ]
-------------------------------------------------------------
e-mail: ruslan at paradigmasoft.com
web: http://www.paradigmasoft.com

To subscribe to the Valentina mail list go to:
http://lists.macserve.net/mailman/listinfo/valentina
-------------------------------------------------------------



More information about the Valentina-beta mailing list