IndexStyle
Erik Mueller-Harder
valentina-list at vermontsoftworks.com
Wed Nov 3 09:33:04 CST 2004
On Nov 2, 2004, at 19:33, Ruslan Zasukhin wrote:
> Today also I have found that right now it split on words not good.
> It not consider punctuation.
>
> I will add this, this is easy.
>
> ** But this bring the issue which we have discuss before.
> Ability for developer define own set of "punctuation chars"
> or "breakers"
>
> So I think IndexStyle should be improved for this.
>
> indStle.SetBreakers( "./*-+?:;'"|\\<>%^#@!±§+" )
Yes. Being able to set individual breaking characters would be great.
> ** also working with Jon project I have to see that into index comes
> such
> words as
>
> "a" "the"
>
> ** Also comes numbers as 0455477 1214
>
> -- I think may be Valentina should on default use style with length
> limit at
> least 2 or 3 or may be even 4? Or better put this on developer ?
I'd be inclined to want everything indexed unless I as a developer
explicitly overrode that behavior. But I suppose that having a default
length limit of 2 or maybe 3 would be OK -- as long as it was very
clearly documented and overrideable.
> -- May be we can to have also
> style.DisableNumbers = true ?
>
> does this have sense?
> may be better to have some other more general way ?
I'd prefer a more general solution.
It seems to me that it might be handy to be able to use Unicode
"categories" as convenient shortcuts in the discussions of both
breaking characters and word indexing.
The main Unicode categories are letters, numbers, punctuation, symbols,
marks, separators, and miscellaneous. Subcategories exist for upper- &
lower-case letters, different types of punctuation, etc. And there are
other category "properties" such as whitespace, quotation marks,
alphabetic, etc. All the work has already been done to assign a
category and (optionally) properties to each Unicode code point.
<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
<http://www.unicode.org/Public/UNIDATA/UCD.html#White_Space> (etc.)
Developers could easily define their break-character categories (e.g.,
separators, punctuation, and whitespace) and then define what resulting
"words" they want indexed (e.g., any "word" whose characters are all
letters, every "word" that doesn't include a number, or only "words"
made up of only uppercase letters).
Perhaps methods such as:
style.BreakCategory("separator") = True // All separators and
punctuation
style.BreakCategory("punctuation") = True // are now break characters,
style.BreakCategory("number") = False // but numbers are not.
could work for defining break-character categories. Defining good
Valentina defaults would keep us from having to use these too much, of
course, and we'd probably still need to be able to override specific
items in the break-character categories -- perhaps with your
"SetBreakers" method, above. Still, the ability to define large groups
of characters quickly, without having to worry about whether or not
you've left out a character or two, would be quite worthwhile, I think.
The question of "word" indexing (e.g., "apple", "0455477") is more
difficult, it seems to me, because some developers might need to index,
for example, all "words" that do not contain any digits, whereas others
might need to index those that contain digits but are not exclusively
numeric:
apple
0455477
3M
R2D2
Perhaps something like:
style.IndexCategory("number") = True // 0455477, 3M,
R2D2 are indexed
style.IndexCategory("number" valentina.all) = True // 0455477 is
indexed
style.IndexCategory("number" valentina.any) = True // 3M, R2D2 are
indexed
Hmm. That needs some further work, I think.
Anyway, my point is that this kind of general solution will end up
giving the developer much more flexibility. Using broad categories, we
can save keystrokes and avoid leaving out specific code points. With
useful defaults, we shouldn't have to spend too much time thinking
about all this, except in special cases.
Thoughts?
-- Erik
More information about the Valentina-beta
mailing list