Page 1 of 1

Finding Duplicate Files

Posted: Wed Nov 29, 2006 11:35 am
by apkellogg
Is there a program to find duplicate files across multiple hard drives? Basically, I have shared folder that too many people have had access to over the years and I would like to see if there are duplicate files saved at multiple places in the folder under different file names. I am using Windows XP Pro/MCE 2005.

Thanks you for any advice.

Re: Finding Duplicate Files

Posted: Wed Nov 29, 2006 11:47 am
by Dposcorp
apkellogg wrote:
Is there a program to find duplicate files across multiple hard drives? Basically, I have shared folder that too many people have had access to over the years and I would like to see if there are duplicate files saved at multiple places in the folder under different file names. I am using Windows XP Pro/MCE 2005.

Thanks you for any advice.


If they have different names, then they are different files.

You would probably need a program to search by size, but I am just guessing at this point.

Posted: Wed Nov 29, 2006 11:52 am
by red0510

Posted: Wed Nov 29, 2006 1:43 pm
by just brew it!
Stupid shell tricks FTW! If you have a set of Unix-style shell tools (like Cygwin) available, the following script will do it:
#!/bin/bash
find "$@" -type f -print | sed -e 's/^/sha1sum "/; s/$/"/' | bash | sort | uniq --all-repeated=separate --check-chars=40 | sed -e "s/[^ ]* .//"

(If your browser has wrapped the above code, note that the only line break is after the "#!/bin/bash"; the rest is all one long line.)

Just invoke the script, passing the names of one or more drives or directories on the command line. The script searches all of the listed drives/folders, and lists each group of duplicate files it finds.

It works by recursively walking all of the specified drives/folders, generating a 160-bit checksum for each file, then finding all groups of files which have matching checksums.

So, e.g. if you've saved it as a script named dupfiles, the command:
dupfiles d:/ e:/
would find all duplicate files on your D: and E: drives.

I love little scripting puzzles like this... and it is also an excellent illustration of why I install Cygwin on all of my Windows boxes, and why IMO everyone should learn how to use UNIX-style shell commands. You can accomplish a whole lot with very little code.

Posted: Wed Nov 29, 2006 3:38 pm
by Flying Fox
That doublekiller thing looks to be able to compare even by size and dates. :o

Posted: Wed Nov 29, 2006 3:40 pm
by just brew it!
Flying Fox wrote:
That doublekiller thing looks to be able to compare even by size and dates. :o

Checksums are more reliable than looking at size and date though... :D

Posted: Wed Nov 29, 2006 4:12 pm
by TheDVDMan
http://www.tucows.com/preview/373411

http://noclone.net/

There are lots of these types of apps floating around. However, I have yet to find one that does true byte-for-byte compares and doesn't produce false positives, and is easy to use.

So far the best I have come across - aside from a very few false postives - is the now very old ACDSee 3.2 with Duplicate file finder plugin. It works on more than just images, too.

Of course, finding an app that old is troublesome...

Posted: Wed Nov 29, 2006 4:29 pm
by just brew it!
I'm surprised that the tools give false positives; if the length of the files match, the tool should then do a byte-for-byte comparison of the contents to verify the match.

While false-positives are theoretically possible with a checksum-based approach like the one I gave the script for above, the odds are mathematically so low (it's a 160-bit hash, so the odds of getting a collision are vanishingly small) that practically speaking you'll never see one.

Posted: Wed Nov 29, 2006 5:16 pm
by TheDVDMan
just brew it! wrote:
I'm surprised that the tools give false positives; if the length of the files match, the tool should then do a byte-for-byte comparison of the contents to verify the match.

While false-positives are theoretically possible with a checksum-based approach like the one I gave the script for above, the odds are mathematically so low (it's a 160-bit hash, so the odds of getting a collision are vanishingly small) that practically speaking you'll never see one.


Yup. I'm not sure what method ACDSee uses. It may only be a 32-bit checksum.

The other app the did give more than ACDSee did I can't remember the name of now. I think it's just called "DupFinder" or something.

I love ACDSee's dup finder; it has great options for auto-deleting the dups it finds...very few other apps seem to have that ability.