FW: [icu-support] Index by words and apostrophes
Ruslan Zasukhin
sunshine at public.kherson.ua
Sat Dec 2 13:20:28 CST 2006
------ Forwarded Message
From: Ruslan Zasukhin <sunshine at public.kherson.ua>
Date: Sat, 02 Dec 2006 11:46:17 +0200
To: ICU support mailing list <icu-support at lists.sourceforge.net>
Conversation: [icu-support] Index by words and apostrophes
Subject: Re: [icu-support] Index by words and apostrophes
On 12/1/06 9:53 PM, "Rick Cameron" <Rick.Cameron at businessobjects.com> wrote:
Hi Andy,
Hi Rick,
Thank you for answers!
This is bad news actually. I was sure that with ICU we will not have such
problems....
So far I have not catch: and how we can resolve this issue?
* use Rule-based Word Iterator ?
* use own algorithms? ... nightmare...
> The OP also gives an example in English: searching for "lion" as a whole
> word in "The lion's pride" fails, too. I just tried this in MS Word
> 2003, and the search succeeds.
>
> This could be trickier than "l'orgueil" since in the latter case, the
> "l" really is a separate word, whereas in the former case, the "s" is
> not.
>
> Perhaps this means that the ICU word break iterator is not a good tool
> to use when doing a search with the "whole word" option selected.
>
> Cheers
>
> - rick
>
> -----Original Message-----
> From: icu-support-bounces at lists.sourceforge.net
> [mailto:icu-support-bounces at lists.sourceforge.net] On Behalf Of Andy
> Heninger
> Sent: Friday, 1 December 2006 11:17
> To: ICU support mailing list
> Cc: valentina at lists.macserve.net; Pierre Rossel
> Subject: Re: [icu-support] Index by words and apostrophes
>
> On 11/30/06, Ruslan Zasukhin <sunshine at public.kherson.ua> wrote:
>> On 11/30/06 12:45 AM, "Pierre Rossel" <agora07 at prossel.com> wrote:
>>
>> I have CC your question to ICU list also to get the best answer.
>>
>> What I think is:
>> Valentina give you access to 7-9 parameters of Locale.
>>
>> I think that if French have some special rules, then if you set
>> correct settings, ICU will do correct job.
>>
>> ----
>> For info of ICU list: in Valentina we use just WordBreakIterator which
>
>> search for boundaries of tokens according to current Locale settings.
>>
>
> The ICU word break iterator does not have specialized behavior for
> French at this time, but just uses the default boundary conditions
> specified by Unicode UAX 29, meaning that an apostrophe appearing
> between two letters is included in the word.
>
> There is a pending request for ICU to provide French specific behavior.
>
> UAX-29 has this to say about apostrophes:
>
>> The use of the apostrophe is ambiguous. It is usually considered part
>> of one word ("can't" or "aujourd'hui") but it may also be considered
>> as part of two words ("l'objectif"). A further complication is the use
>
>> of the same character as an apostrophe and as a quotation mark.
>> Therefore leading or trailing apostrophes are best excluded from the
>> default definition of a word. In some languages, such as French and
>> Italian, tailoring to break words when the character after the
>> apostrophe is a vowel may yield better results in more cases. This can
>
>> be done by adding a rule WB5a.
>
> http://www.unicode.org/reports/tr29/
--
Best regards,
Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc
Valentina - Joining Worlds of Information
http://www.paradigmasoft.com
[I feel the need: the need for speed]
------ End of Forwarded Message
More information about the Valentina
mailing list