DRAFT of specification if Indexing By words for strings.
Erik Mueller-Harder
valentina-list at vermontsoftworks.com
Wed Sep 22 22:33:41 CDT 2004
On Sep 22, 2004, at 16:10, Ruslan Zasukhin wrote:
> IBM ICU have class BreakIterator
>
> It make sure that for specified language we will get the correct words.
Excellent; I'm sure that will take care of at least 95% of our needs.
I do wonder if there could be someway to override it, though. I could
see an override being useful, for example, in the case of *last names*
(here we go again!). A hyphen character ("-") is almost always
considered a word boundary. Occasionally, however, there have been
times when I'd have liked to be able to treat hyphenated words as
one-word sets in an indexed-by-word field (last-name fields come to
mind!).
Perhaps you could provide an interface similar to what we've been
discussing re prohibited index words, letting us add characters to
word-break sets and to non-word-break sets as needed?
Also, there may be situations where a field defined as, for example,
English may contain words from other languages, in which case the IBM
ICU might cause the word "rechêrche" (for example) to be indexed as
"rech" and "rche," as (I believe) the Valentina 1.x kernel does
(certainly it does this with many "accent" characters). If we could
have the capability of adding "ê" (in this case) to a set of word-break
characters, we could fine-tune the ICU as appropriate for our specific
applications.
Do others see a need for this, too, or is this my own pipe-dream?
Cheers,
-- Erik
More information about the Valentina-beta
mailing list