FW: [icu-support] Index by words and apostrophes

Ruslan Zasukhin sunshine at public.kherson.ua
Sat Dec 2 13:19:32 CST 2006



------ Forwarded Message
From: Andy Heninger <andy.heninger at gmail.com>
Reply-To: ICU support mailing list <icu-support at lists.sourceforge.net>
Date: Fri, 1 Dec 2006 11:16:36 -0800
To: ICU support mailing list <icu-support at lists.sourceforge.net>
Cc: "valentina at lists.macserve.net" <valentina at lists.macserve.net>, Pierre
Rossel <agora07 at prossel.com>
Subject: Re: [icu-support] Index by words and apostrophes

On 11/30/06, Ruslan Zasukhin <sunshine at public.kherson.ua> wrote:
> On 11/30/06 12:45 AM, "Pierre Rossel" <agora07 at prossel.com> wrote:
>
> I have CC your question to ICU list also to get the best answer.
>
> What I think is:
>     Valentina give you access to 7-9 parameters of Locale.
>
> I think that if French have some special rules, then if you set correct
> settings, ICU will do correct job.
>
> ----
> For info of ICU list: in Valentina we use just WordBreakIterator which
> search for boundaries of tokens according to current Locale settings.
>

The ICU word break iterator does not have specialized behavior for
French at this time, but just uses the default boundary conditions
specified by Unicode UAX 29, meaning that an apostrophe appearing
between two letters is included in the word.

There is a pending request for ICU to provide French specific behavior.

UAX-29 has this to say about apostrophes:

> The use of the apostrophe is ambiguous. It is usually considered part of
> one word ("can't" or "aujourd'hui") but it may also be considered as part
> of two words ("l'objectif"). A further complication is the use of the same
> character as an apostrophe and as a quotation mark. Therefore leading
> or trailing apostrophes are best excluded from the default definition of
> a word. In some languages, such as French and Italian, tailoring to
> break words when the character after the apostrophe is a vowel may
> yield better results in more cases. This can be done by adding a
> rule WB5a.

http://www.unicode.org/reports/tr29/

  -- Andy

>
> > Hello,
> >
> > I have noticed that the apostrophe character is not considered as a word
> > separator when a text field is indexed by words.
> >
> > If the text contains a sentence such as "The lion's pride", a search on the
> > exact word "lion" won't match.
> >
> > In French, the same sentence would translate to "L'orgueil du lion". Notice
> > the apostrophe which separates "L" from "orgueil". "L" means "The" in
> > English. A search on "orgueil" won't match the record as "L'orgueil" is
> > considered as one word.
> >
> > I have reported this as a bug in Mantis
> > http://www.valentina-db.com/bt/view.php?id=2008
> >
> > This is a real problem for me as some words cannot be found by my search
> > engine if they are next to an apostrophe.
> >
> > By the way, I have tried to search the word "orgueil" in the text "L'orgueil
> > du lion" with several word processing applications and text editors, with
> > the option  "whole words only". They all found it, so they all consider the
> > apostrophe as a word separator. Why should Valentina behave otherwise ?
> >
> > What do other developers think ?
> > Apostrophe is part of a word or not ?
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
icu-support mailing list - icu-support at lists.sourceforge.net
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

------ End of Forwarded Message




More information about the Valentina mailing list