FW: [icu-support] Index by words and apostrophes
Ruslan Zasukhin
sunshine at public.kherson.ua
Sat Dec 2 13:20:15 CST 2006
------ Forwarded Message
From: Rick Cameron <Rick.Cameron at businessobjects.com>
Reply-To: ICU support mailing list <icu-support at lists.sourceforge.net>
Date: Fri, 1 Dec 2006 11:53:03 -0800
To: ICU support mailing list <icu-support at lists.sourceforge.net>
Cc: "valentina at lists.macserve.net" <valentina at lists.macserve.net>, Pierre
Rossel <agora07 at prossel.com>
Conversation: [icu-support] Index by words and apostrophes
Subject: Re: [icu-support] Index by words and apostrophes
The OP also gives an example in English: searching for "lion" as a whole
word in "The lion's pride" fails, too. I just tried this in MS Word
2003, and the search succeeds.
This could be trickier than "l'orgueil" since in the latter case, the
"l" really is a separate word, whereas in the former case, the "s" is
not.
Perhaps this means that the ICU word break iterator is not a good tool
to use when doing a search with the "whole word" option selected.
Cheers
- rick
-----Original Message-----
From: icu-support-bounces at lists.sourceforge.net
[mailto:icu-support-bounces at lists.sourceforge.net] On Behalf Of Andy
Heninger
Sent: Friday, 1 December 2006 11:17
To: ICU support mailing list
Cc: valentina at lists.macserve.net; Pierre Rossel
Subject: Re: [icu-support] Index by words and apostrophes
On 11/30/06, Ruslan Zasukhin <sunshine at public.kherson.ua> wrote:
> On 11/30/06 12:45 AM, "Pierre Rossel" <agora07 at prossel.com> wrote:
>
> I have CC your question to ICU list also to get the best answer.
>
> What I think is:
> Valentina give you access to 7-9 parameters of Locale.
>
> I think that if French have some special rules, then if you set
> correct settings, ICU will do correct job.
>
> ----
> For info of ICU list: in Valentina we use just WordBreakIterator which
> search for boundaries of tokens according to current Locale settings.
>
The ICU word break iterator does not have specialized behavior for
French at this time, but just uses the default boundary conditions
specified by Unicode UAX 29, meaning that an apostrophe appearing
between two letters is included in the word.
There is a pending request for ICU to provide French specific behavior.
UAX-29 has this to say about apostrophes:
> The use of the apostrophe is ambiguous. It is usually considered part
> of one word ("can't" or "aujourd'hui") but it may also be considered
> as part of two words ("l'objectif"). A further complication is the use
> of the same character as an apostrophe and as a quotation mark.
> Therefore leading or trailing apostrophes are best excluded from the
> default definition of a word. In some languages, such as French and
> Italian, tailoring to break words when the character after the
> apostrophe is a vowel may yield better results in more cases. This can
> be done by adding a rule WB5a.
http://www.unicode.org/reports/tr29/
-- Andy
>
> > Hello,
> >
> > I have noticed that the apostrophe character is not considered as a
> > word separator when a text field is indexed by words.
> >
> > If the text contains a sentence such as "The lion's pride", a search
> > on the exact word "lion" won't match.
> >
> > In French, the same sentence would translate to "L'orgueil du lion".
> > Notice the apostrophe which separates "L" from "orgueil". "L" means
> > "The" in English. A search on "orgueil" won't match the record as
> > "L'orgueil" is considered as one word.
> >
> > I have reported this as a bug in Mantis
> > http://www.valentina-db.com/bt/view.php?id=2008
> >
> > This is a real problem for me as some words cannot be found by my
> > search engine if they are next to an apostrophe.
> >
> > By the way, I have tried to search the word "orgueil" in the text
> > "L'orgueil du lion" with several word processing applications and
> > text editors, with the option "whole words only". They all found
> > it, so they all consider the apostrophe as a word separator. Why
should Valentina behave otherwise ?
> >
> > What do other developers think ?
> > Apostrophe is part of a word or not ?
>
------------------------------------------------------------------------
-
Take Surveys. Earn Cash. Influence the Future of IT Join
SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE
V
_______________________________________________
icu-support mailing list - icu-support at lists.sourceforge.net To
Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
icu-support mailing list - icu-support at lists.sourceforge.net
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support
------ End of Forwarded Message
More information about the Valentina
mailing list