DRAFT of specification if Indexing By words for strings.

Wed Sep 22 22:33:41 CDT 2004

On Sep 22, 2004, at 16:10, Ruslan Zasukhin wrote:

> IBM ICU have class BreakIterator
>
> It make sure that for specified language we will get the correct words.

Excellent; I'm sure that will take care of at least 95% of our needs.

I do wonder if there could be someway to override it, though.  I could 
see an override being useful, for example, in the case of *last names* 
(here we go again!).  A hyphen character ("-") is almost always 
considered a word boundary.  Occasionally, however, there have been 
times when I'd have liked to be able to treat hyphenated words as 
one-word sets in an indexed-by-word field (last-name fields come to 
mind!).

Perhaps you could provide an interface similar to what we've been 
discussing re prohibited index words, letting us add characters to 
word-break sets and to non-word-break sets as needed?

Also, there may be situations where a field defined as, for example, 
English may contain words from other languages, in which case the IBM 
ICU might cause the word "rechêrche" (for example) to be indexed as 
"rech" and "rche," as (I believe) the Valentina 1.x kernel does 
(certainly it does this with many "accent" characters).  If we could 
have the capability of adding "ê" (in this case) to a set of word-break 
characters, we could fine-tune the ICU as appropriate for our specific 
applications.

Do others see a need for this, too, or is this my own pipe-dream?

Cheers,

-- Erik