IndexStyle

Wed Nov 3 09:33:04 CST 2004

On Nov 2, 2004, at 19:33, Ruslan Zasukhin wrote:

> Today also I have found that right now it split on words not good.
> It not consider punctuation.
>
> I will add this, this is easy.
>
> ** But this bring the issue which we have discuss before.
> Ability for developer define own set of "punctuation chars"
>     or "breakers"
>
> So I think IndexStyle should be improved for this.
>
>     indStle.SetBreakers( "./*-+?:;'"|\\<>%^#@!±§+" )

Yes.  Being able to set individual breaking characters would be great.

> ** also working with Jon project I have to see that into index comes 
> such
> words as
>
>     "a" "the"
>
> ** Also comes numbers as 0455477 1214
>
> -- I think may be Valentina should on default use style with length 
> limit at
> least 2 or 3 or may be even 4?  Or better put this on developer ?

I'd be inclined to want everything indexed unless I as a developer 
explicitly overrode that behavior.  But I suppose that having a default 
length limit of 2 or maybe 3 would be OK -- as long as it was very 
clearly documented and overrideable.

> -- May be we can to have also
>         style.DisableNumbers = true     ?
>
>     does this have sense?
>     may be better to have some other more general way ?

I'd prefer a more general solution.

It seems to me that it might be handy to be able to use Unicode 
"categories" as convenient shortcuts in the discussions of both 
breaking characters and word indexing.

The main Unicode categories are letters, numbers, punctuation, symbols, 
marks, separators, and miscellaneous.  Subcategories exist for upper- & 
lower-case letters, different types of punctuation, etc.  And there are 
other category "properties" such as whitespace, quotation marks, 
alphabetic, etc.  All the work has already been done to assign a 
category and (optionally) properties to each Unicode code point.

<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
<http://www.unicode.org/Public/UNIDATA/UCD.html#White_Space> (etc.)

Developers could easily define their break-character categories (e.g., 
separators, punctuation, and whitespace) and then define what resulting 
"words" they want indexed (e.g., any "word" whose characters are all 
letters, every "word" that doesn't include a number, or only "words" 
made up of only uppercase letters).

Perhaps methods such as:

	style.BreakCategory("separator") = True    // All separators and 
punctuation
	style.BreakCategory("punctuation") = True  // are now break characters,
	style.BreakCategory("number") = False      // but numbers are not.

could work for defining break-character categories.  Defining good 
Valentina defaults would keep us from having to use these too much, of 
course, and we'd probably still need to be able to override specific 
items in the break-character categories -- perhaps with your 
"SetBreakers" method, above.  Still, the ability to define large groups 
of characters quickly, without having to worry about whether or not 
you've left out a character or two, would be quite worthwhile, I think.

The question of "word" indexing (e.g., "apple", "0455477") is more 
difficult, it seems to me, because some developers might need to index, 
for example, all "words" that do not contain any digits, whereas others 
might need to index those that contain digits but are not exclusively 
numeric:

	apple
	0455477
	3M
	R2D2

Perhaps something like:

	style.IndexCategory("number") = True                // 0455477, 3M, 
R2D2 are indexed
	style.IndexCategory("number" valentina.all) = True  // 0455477 is 
indexed
	style.IndexCategory("number" valentina.any) = True  // 3M, R2D2 are 
indexed

Hmm.  That needs some further work, I think.

Anyway, my point is that this kind of general solution will end up 
giving the developer much more flexibility.  Using broad categories, we 
can save keystrokes and avoid leaving out specific code points.  With 
useful defaults,  we shouldn't have to spend too much time thinking 
about all this, except in special cases.

Thoughts?

-- Erik