Sorting out and mixing new together
Thorsten Hohage
thohage at objectmanufactur.com
Thu Apr 17 01:34:06 CDT 2008
Hi everbody,
I've got a really tough task here on my table and I asked myself what
might be the best strategy for doing it.
So what we're talking about:
I've had downloaded some table from a big system, need to clean up /
change / maintain one big textfield and then update the big system.
The table have about 120000 records and the max field length is 800+
chars. This field contains one or more phrases and a phrase contains
one or more words.
Some years ago two very bad things happened and was not obvious to the
customer.
1. They forget about the difference of a word (space) delimiter and a
phrase delimiter (comma) - now it's impossible to determine if a comma
is used for word or phrase delimiter.
2. Far more worse, they start editing the field manually :-0 - so it
contains not only phrases, but manually entered text
Now they compile (based on some old tabes) a new list of phrases that
should be used to fill the field, this table is about 1200 records.
My job is now to identify as much phrases as possible in each field/
row and for the first step build a cross-link-table between the record
and the used phrases from the phrase list. But I want to avoid to
build plain straight forward stored procedures and get ways to long
execution time.
What make this job so special is of course, in the new phrases list
I've a phrase like
aa, ab, ac, ad
in the table there may be contents of the field like
aa, ab, ac, ad
aa, ab ac ad
aa ab ac ad
or some other funny combinations
aa-ab-ac-ad
aa/ab/ac/ad
due to the manual editing thing I even want to identify the phrase if
a) al for words are in the field, but not the whole phrase
b) if there is at least a phrase with any of 3 matched to the given
phrase above e.g. "ad, ab, aa", "aa, ac, ad"
c) if there is no other phrase in the list containing a "aa"
After all this cleaning there should be only the "garbage" content in
the field that must be cleared by hand.
So I need to step over the phrase list 1200 - records
1. search for the full phrase - if found replace
2. search for a stripped phrase (NO delimeters only one space) on a
stripped field - replace might be difficult
3. build all other combinations and do a search / replace for each
I know Valentina is amazingly fast, but even then this task might run
for many hours.
Any ideas on how to speed up the task are wecome - TIA!
regards
Thorsten Hohage
--
objectmanufactur.com - Hamburg,Germany
More information about the Valentina
mailing list