Speeding up finding duplicates
Ruslan Zasukhin
ruslan_zasukhin at valentina-db.com
Wed Apr 27 08:45:50 CDT 2011
On 4/23/11 8:22 AM, "Beatrix Willius" <bwillius at gmx.de> wrote:
Hi Beatrix,
CheckSum is stored in a field of table Mail ?
It should be stored. And indexes.
May be can be used just indexed method.
Then you can do without any views:
SELECT CheckSum, COUNT(CheckSum) as 'c'
FROM Mail
GROUP BY CheckSum
HAVING c >= 2
Only duplicates will be SELECTed
> I'm trying to speed up finding duplicate entries in my database.
>
> The database itself has a simple structure: Mailboxes - Mails - Attachments. I
> want to find duplicates with mails. Each mail has a checksum, which identifies
> it.
>
> The solution that was too slow was a view
>
> create or replace view CheckDuplicates as (select CheckSum, count(CheckSum) as
> CountOfDup from Message group by CheckSum)
>
> Then I added a condition to the normal search
>
> Message.RecID in (select RecID from Message inner join CheckDuplicates on
> CheckDuplicates.CheckSum = Message.CheckSum where CheckDuplicates.CountOfDup >
> 1) and Mailbox.MailboxPath <> 'Trash')
>
> This takes way too long for larger databases.
>
> The idea I have now is to calculate duplicates after adding mails to the
> database.
>
> create table if not exists FindDuplicates1 (CheckSum Long, CountOfDups
> Integer)
> delete from FindDuplicates1
> create table if not exists FindDuplicates2 (MessageRecID Long)
> delete from FindDuplicates2
> insert into FindDuplicates1 select CheckSum, count(CheckSum) as CountOfDup
> from Message group by CheckSum having countofdup > 1
> insert into FindDuplicates2 select Message.RecID from Message inner join
> FindDuplicates1 on FindDuplicates1.CheckSum = Message.CheckSum
>
> Then the search sql would just use the FindDuplicates2 table.
>
> This is faster, but not fast enough. In my testing this may take several
> seconds. And it has the drawback that I need to do this when archiving is
> cancelled, too.
>
> Does anyone have another idea how to speed this up?
--
Best regards,
Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc
Valentina - Joining Worlds of Information
http://www.paradigmasoft.com
[I feel the need: the need for speed]
More information about the Valentina
mailing list