[icu-support] Index by words and apostrophes

Ruslan Zasukhin sunshine at public.kherson.ua
Tue Dec 5 09:18:07 CST 2006


On 12/5/06 12:38 AM, "Robert Brenstein" <rjb at robelko.com> wrote:

>> The OP also gives an example in English: searching for "lion" as a whole
>> word in "The lion's pride" fails, too. I just tried this in MS Word
>> 2003, and the search succeeds.
>> 
>> This could be trickier than "l'orgueil" since in the latter case, the
>> "l" really is a separate word, whereas in the former case, the "s" is
>> not.
>> 
>> Perhaps this means that the ICU word break iterator is not a good tool
>> to use when doing a search with the "whole word" option selected.
>> 
>> Cheers
>> 
>> - rick
> 
> I wonder whether the search in Word will pick lioness when searching
> for lion and whether it finds can't as a word.

On ICU list was next letter yesterday:

-------------------------------------------------------------
Looking at the "The lion's pride" example in MS Word, its behavior is
a little more complex than can be had with a simple break iterator.
Both "lion" and "lion's" match with the whole word option, but a space
in the string disables the whole word search option.  Something at
least a little customized will be needed to implement this sort of
thing.

Maybe find words with ICU, and then also split the found words on
apostrophe if any occur, then match either the parts or whole.  You
could modify the ICU word rules to give a unique status value for
words containing apostrophes, which would avoid the overhead of
rescanning each word.


-- 
Best regards,

Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc

Valentina - Joining Worlds of Information
http://www.paradigmasoft.com

[I feel the need: the need for speed]




More information about the Valentina mailing list