Speeding up finding duplicates

Beatrix Willius bwillius at gmx.de
Sat Apr 23 00:22:41 CDT 2011


I'm trying to speed up finding duplicate entries in my database.

The database itself has a simple structure: Mailboxes - Mails - Attachments. I want to find duplicates with mails. Each mail has a checksum, which identifies it.

The solution that was too slow was a view

create or replace view CheckDuplicates as (select CheckSum, count(CheckSum) as CountOfDup from Message group by CheckSum)

Then I added a condition to the normal search

  Message.RecID in (select RecID from Message inner join CheckDuplicates on CheckDuplicates.CheckSum = Message.CheckSum where CheckDuplicates.CountOfDup > 1) and Mailbox.MailboxPath <> 'Trash')

This takes way too long for larger databases.

The idea I have now is to calculate duplicates after adding mails to the database.

create table if not exists FindDuplicates1 (CheckSum Long, CountOfDups Integer)
delete from FindDuplicates1
create table if not exists FindDuplicates2 (MessageRecID Long)
delete from FindDuplicates2
insert into FindDuplicates1 select CheckSum, count(CheckSum) as CountOfDup from Message group by CheckSum having countofdup > 1
insert into FindDuplicates2 select Message.RecID from Message inner join FindDuplicates1 on FindDuplicates1.CheckSum = Message.CheckSum

Then the search sql would just use the FindDuplicates2 table.

This is faster, but not fast enough. In my testing this may take several seconds. And it has the drawback that I need to do this when archiving is cancelled, too.

Does anyone have another idea how to speed this up?

Using Valentina 4.8 with RB 2010r5.  

Mit freundlichen Grüßen/Regards

Trixi Willius

http://www.mothsoftware.com
Mail Archiver X: archive, clean and search email
http://www.beatrixwillius.de
Fractals, 3d landscapes etc.



More information about the Valentina mailing list