Sorting out and mixing new together

Thorsten Hohage thohage at objectmanufactur.com
Thu Apr 17 01:34:06 CDT 2008


Hi everbody,

I've got a really tough task here on my table and I asked myself what  
might be the best strategy for doing it.

So what we're talking about:

I've had downloaded some table from a big system, need to clean up /  
change / maintain one big textfield and then update the big system.

The table have about 120000 records and the max field length is 800+  
chars. This field contains one or more phrases and a phrase contains  
one or more words.

Some years ago two very bad things happened and was not obvious to the  
customer.

1. They forget about the difference of a word (space) delimiter and a  
phrase delimiter (comma) - now it's impossible to determine if a comma  
is used for word or phrase delimiter.

2. Far more worse, they start editing the field manually :-0 - so it  
contains not only phrases, but manually entered text


Now they compile (based on some old tabes) a new list of phrases that  
should be used to fill the field, this table is about 1200 records.


My job is now to identify as much phrases as possible in each field/ 
row and for the first step build a cross-link-table between the record  
and the used phrases from the phrase list. But I want to avoid to  
build plain straight forward stored procedures and get ways to long  
execution time.

What make this job so special is of course, in the new phrases list  
I've a phrase like

	aa, ab, ac, ad

in the table there may be contents of the field like

	aa, ab, ac, ad
	aa, ab ac ad
	aa ab ac ad

or some other funny combinations

	aa-ab-ac-ad
	aa/ab/ac/ad


due to the manual editing thing I even want to identify the phrase if

	a) al for words are in the field, but not the whole phrase
	b) if there is at least a phrase with any of 3 matched to the given  
phrase above e.g. "ad, ab, aa", "aa, ac, ad"
	c) if there is no other phrase in the list containing a "aa"


After all this cleaning there should be only the "garbage" content in  
the field that must be cleared by hand.


So I need to step over the phrase list 1200 - records

	1. search for the full phrase - if found replace
	2. search for a stripped phrase (NO delimeters only one space) on a  
stripped field - replace might be difficult

	3. build all other combinations and do a search / replace for each

I know Valentina is amazingly fast, but even then this task might run  
for many hours.


Any ideas on how to speed up the task are wecome - TIA!


regards

Thorsten Hohage
--
objectmanufactur.com - Hamburg,Germany




More information about the Valentina mailing list