FW: [icu-support] Index by words and apostrophes

Ruslan Zasukhin sunshine at public.kherson.ua
Sat Dec 2 13:20:28 CST 2006



------ Forwarded Message
From: Ruslan Zasukhin <sunshine at public.kherson.ua>
Date: Sat, 02 Dec 2006 11:46:17 +0200
To: ICU support mailing list <icu-support at lists.sourceforge.net>
Conversation: [icu-support] Index by words and apostrophes
Subject: Re: [icu-support] Index by words and apostrophes

On 12/1/06 9:53 PM, "Rick Cameron" <Rick.Cameron at businessobjects.com> wrote:

Hi Andy,
Hi Rick,

Thank you for answers!

This is bad news actually. I was sure that with ICU we will not have such
problems....

So far I have not catch: and how we can resolve this issue?

    * use Rule-based Word Iterator ?

    * use own algorithms?  ... nightmare...



> The OP also gives an example in English: searching for "lion" as a whole
> word in "The lion's pride" fails, too. I just tried this in MS Word
> 2003, and the search succeeds.
> 
> This could be trickier than "l'orgueil" since in the latter case, the
> "l" really is a separate word, whereas in the former case, the "s" is
> not.
> 
> Perhaps this means that the ICU word break iterator is not a good tool
> to use when doing a search with the "whole word" option selected.
> 
> Cheers
> 
> - rick
> 
> -----Original Message-----
> From: icu-support-bounces at lists.sourceforge.net
> [mailto:icu-support-bounces at lists.sourceforge.net] On Behalf Of Andy
> Heninger
> Sent: Friday, 1 December 2006 11:17
> To: ICU support mailing list
> Cc: valentina at lists.macserve.net; Pierre Rossel
> Subject: Re: [icu-support] Index by words and apostrophes
> 
> On 11/30/06, Ruslan Zasukhin <sunshine at public.kherson.ua> wrote:
>> On 11/30/06 12:45 AM, "Pierre Rossel" <agora07 at prossel.com> wrote:
>> 
>> I have CC your question to ICU list also to get the best answer.
>> 
>> What I think is:
>>     Valentina give you access to 7-9 parameters of Locale.
>> 
>> I think that if French have some special rules, then if you set
>> correct settings, ICU will do correct job.
>> 
>> ----
>> For info of ICU list: in Valentina we use just WordBreakIterator which
> 
>> search for boundaries of tokens according to current Locale settings.
>> 
> 
> The ICU word break iterator does not have specialized behavior for
> French at this time, but just uses the default boundary conditions
> specified by Unicode UAX 29, meaning that an apostrophe appearing
> between two letters is included in the word.
> 
> There is a pending request for ICU to provide French specific behavior.
> 
> UAX-29 has this to say about apostrophes:
> 
>> The use of the apostrophe is ambiguous. It is usually considered part
>> of one word ("can't" or "aujourd'hui") but it may also be considered
>> as part of two words ("l'objectif"). A further complication is the use
> 
>> of the same character as an apostrophe and as a quotation mark.
>> Therefore leading or trailing apostrophes are best excluded from the
>> default definition of a word. In some languages, such as French and
>> Italian, tailoring to break words when the character after the
>> apostrophe is a vowel may yield better results in more cases. This can
> 
>> be done by adding a rule WB5a.
> 
> http://www.unicode.org/reports/tr29/

-- 
Best regards,

Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc

Valentina - Joining Worlds of Information
http://www.paradigmasoft.com

[I feel the need: the need for speed]

------ End of Forwarded Message




More information about the Valentina mailing list