Speeding up finding duplicates

Ruslan Zasukhin ruslan_zasukhin at valentina-db.com
Wed Apr 27 08:45:50 CDT 2011


On 4/23/11 8:22 AM, "Beatrix Willius" <bwillius at gmx.de> wrote:

Hi Beatrix,

CheckSum is stored in a field of table Mail ?

It should be stored. And indexes.
May be can be used just indexed method.

Then you can do without  any views:

SELECT CheckSum, COUNT(CheckSum) as 'c'
FROM Mail
GROUP BY CheckSum
HAVING c >= 2  
    

Only duplicates will be SELECTed


> I'm trying to speed up finding duplicate entries in my database.
> 
> The database itself has a simple structure: Mailboxes - Mails - Attachments. I
> want to find duplicates with mails. Each mail has a checksum, which identifies
> it.
> 
> The solution that was too slow was a view
> 
> create or replace view CheckDuplicates as (select CheckSum, count(CheckSum) as
> CountOfDup from Message group by CheckSum)
> 
> Then I added a condition to the normal search
> 
>   Message.RecID in (select RecID from Message inner join CheckDuplicates on
> CheckDuplicates.CheckSum = Message.CheckSum where CheckDuplicates.CountOfDup >
> 1) and Mailbox.MailboxPath <> 'Trash')
> 
> This takes way too long for larger databases.
> 
> The idea I have now is to calculate duplicates after adding mails to the
> database.
> 
> create table if not exists FindDuplicates1 (CheckSum Long, CountOfDups
> Integer)
> delete from FindDuplicates1
> create table if not exists FindDuplicates2 (MessageRecID Long)
> delete from FindDuplicates2
> insert into FindDuplicates1 select CheckSum, count(CheckSum) as CountOfDup
> from Message group by CheckSum having countofdup > 1
> insert into FindDuplicates2 select Message.RecID from Message inner join
> FindDuplicates1 on FindDuplicates1.CheckSum = Message.CheckSum
> 
> Then the search sql would just use the FindDuplicates2 table.
> 
> This is faster, but not fast enough. In my testing this may take several
> seconds. And it has the drawback that I need to do this when archiving is
> cancelled, too.
> 
> Does anyone have another idea how to speed this up?

-- 
Best regards,

Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc

Valentina - Joining Worlds of Information
http://www.paradigmasoft.com

[I feel the need: the need for speed]




More information about the Valentina mailing list