Link two db

Sat Mar 15 19:34:48 CST 2003

on 3/15/03 5:31 PM, Tonio Virgilio LEVRA at levra at yahoo.com wrote:

Hi Tonio,

> maybe my answer was not clear.
> I don't want to you use lingo code to build inverted
> index every time the end user ask for a query. I want
> to build inverted index only one time when I create
> the db.

Okay this change things.

> If I don't do in this way I will need to do a like
> select every time using regular expression in many
> field every time. With the inverted index loaded in
> the db I will query with a "=" select on field of one
> word only, and already sorted.

Yes I see, you want pre-build some database for fast usage later.
This have sense of course.

> For sure I do not understand Valentina deeply,
> otherwise I was not here in asking stupid question,
> but I'm afraid that my needs also was not completely
> understood for sure because not explained enough
> clearly.
> 
> Let's start again, if you like:
> 
> 1. I have one huge flat file of 16 fields collecting
> of the article of a newspaper in one year.
> 
> 2. This flat flat has 1 big text field ("the
> article"), another big text field ("the title") and 8
> describing field ("the categories") containing more
> than one word key each.

Why title is text field? I think it can be VarChar[504]
So I have count 1 + 1 + 8 = 10 fields.

> 3. I have separate lists of this word key category and
> the biggest on this has 800 different possibile word
> keys organized in a key structure.

Okay, so Category N have SORTED list of possible key words.
And biggest list contains ONLY 800 items.
Good.

> 4. I need to build a new version of an end user
> interface (the old one was made with a IBM stairs-like
> information retreival application) in a reasonable way
> not to slow compared with the old one.

I prefer target -- be much faster of old way :-)

> 5. The end user must have the opportunity to search
> info:
> - by words of "the article" and/or
> - by words of "the title" and/or

This is easy in Valentina. Correct?

> - by flagging one or more words for each of the 8
> categories
> - by defining the maximum distance between words
> searched inside the article and/or title (if he
> specifies more than one word)

Last position is not implemented in Valentina, so you need self do it.

> 6. The list of records found must fill the list of
> word key of the categories in the "famous way"
> word1[number], etc.
> 
> 7. The records found need to be sorted in many
> different way (some easy: by date, by title, etc.
> others more difficult for me, by the number of the
> instances of the word searched for every record)
> 
> 8. I have to load this flat file in the best way to
> make faster the queries.

So db will be stable and will not be modified future?

> 9. I've started in writing an algorithm able to load
> the flat file into a Valentina db and coding the
> alphabetic word key of the categories in number.

One question btw, why you do this in Director, but not in e.g. REALbasic?
RB should be faster of Director. Much faster for such tasks.

> 10. Then I've started in querying the big text field,
> the article and/or the little text field of the title.
> There query were like select with regular expression
> in it, because I don't think thera are other way, but
> there are tell me please.

No, you need just set for these 2 fields flag fIndexbyWords.
And Valentina will self build index of words for them.
This way is faster of LIKE.

But you have mention that you also want SORT on title,
Then for title you need also define BaseObject method
    TitleForSort = "Title"

And it must NOT have flag IndexedByWords.
Then you will be able do

    WHERE Title = 'Brother'
    ORDER BY TitleForSort

> 11. The time necessary to obtain the result for these
> quesry was very good even if longer than the older
> version of program (the IBM stairs-like one), but
> maybe acceptable.
> 
> 12. My first problem rised up here where the time
> necessary to fill the list of word key of the
> categories in the "famous way" became unacceptable
> (from here my first question "filter on/include-omit
> function").
> And to this time was necessary to ad the time to sort
> the the diffcult way mentionaed before...

And this point I understand hard.
So you need do this operation only once?
You must not do it at runtime?

* What you mean with "famous way"???

* Again, what you mean here. You import your BIG file,
And you want automatically EXTRACT all words mentioned for some category?
But this must be quite easy:

    -- you have Table_c1 with one unique field 'Word'
    -- you simply try add each words from imported file
        and Valentina will simply reject all not unique fields.

After import you get Table_c1 which keep list of unique words for this
category. As you have say biggest such list will have only 800 records.

Now you can SORT this table and EXPORT it if you want get list.
You can even then IMPORT sorted words in table to get faster work...

> 13. So I've started my research on different approach,
> asking to you and exploring the web, etc...
> 
> 14. I don't want to discuss about to use or not to use
> Valentina, because I still want to use Valentina for
> many reasons, and its speed is one of these reasons,
> but maybe not to query text word from huge fields, but
> to query the inverted index.

Do you want a TIP?  :-)

Avoid to use TEXT and prefer to use NUMBERS.
Bellow I will give you one idea for your task.

> 15. When I talk about inverted indexes, I'm talking of
> inverted indexes builded one time only during the
> loading of the database and then loaded in the db in
> separate table. 
> So even if the time to build them in lingo or in other
> way will take 3 days or more it is not a problem
> because it will be done ONE TIME ONLY.
> 
> 16. So I will use valentina to query the inverted
> indexes and maybe with the valentina once again to
> "merge" them and to extract the final list of record
> with the fomous temp table I've supposed....

Okay, no listen my solution for this task as I have understand it.

1) you create Database with Tables

        Article
        {
            Title as VarChar[504] indexByWords
            TextOfArticle as TEXT
            C1 as VarChar[504] index by words    -- category
            ...
            C8

            ----- and helper BaseObject methods:
            TitleFoSort as VarChar[504] Indexed
        }

        T_C1 
        {
            Word as Unique VarChar[504]
        }

        and so on. 

2) Now you start import your BIG file.
During import you simply try add of a word from category N into table T_CN.

After first such import you get
    -- Table Article
    -- Tables C1 ..C8

Note that fields C1..C8 in Table Article are not filled yet.

3) now you need second iteration / import of your big file.

So you take record1, take its category 1, and for each word of that category
find RecID of that word in table T_C1. So you can build string as for
example:

    "sport, sky, bialton"  =>  "34, 75, 198"

And now you store this string of numbers splinted by COMAs into field C1
first record of Table Article.

Okay I hope you have understand idea.

4) Now you have database ready for use.
Assume user want find "sport". You at first jump into table C1

        SELECT RecID FROM C1 WHERE word = 'sport'

Get from this query 34,
Convert it to string "34"

And you now build SQL string as

    SELECT ... FROM Article WHERE C1 = "34"

You can build now any complexity query

    SELECT ... FROM Article
    WHERE (C1 = "34" or C1 = "198") and (C2 = ...)
            and (title = 'brother')
    ORDER BY TitleforSort.

May be here possible more optimizations...just first ideas...

-- 
Best regards,
Ruslan Zasukhin      [ I feel the need...the need for speed ]
-------------------------------------------------------------
e-mail: ruslan at paradigmasoft.com
web: http://www.paradigmasoft.com

To subscribe to the Valentina mail list go to:
http://listserv.macserve.net/mailman/listinfo/valentina
-------------------------------------------------------------