Personal computing discussed

Moderators: renee, Dposcorp

 
apkellogg
Gerbil Elite
Topic Author
Posts: 962
Joined: Wed Feb 25, 2004 10:15 am

Finding Duplicate Files

Wed Nov 29, 2006 11:35 am

Is there a program to find duplicate files across multiple hard drives? Basically, I have shared folder that too many people have had access to over the years and I would like to see if there are duplicate files saved at multiple places in the folder under different file names. I am using Windows XP Pro/MCE 2005.

Thanks you for any advice.
 
Dposcorp
Minister of Gerbil Affairs
Posts: 2771
Joined: Thu Dec 27, 2001 7:00 pm
Location: Detroit, Michigan

Re: Finding Duplicate Files

Wed Nov 29, 2006 11:47 am

apkellogg wrote:
Is there a program to find duplicate files across multiple hard drives? Basically, I have shared folder that too many people have had access to over the years and I would like to see if there are duplicate files saved at multiple places in the folder under different file names. I am using Windows XP Pro/MCE 2005.

Thanks you for any advice.


If they have different names, then they are different files.

You would probably need a program to search by size, but I am just guessing at this point.
 
red0510
Gerbil Elite
Posts: 612
Joined: Fri Mar 29, 2002 7:00 pm

Wed Nov 29, 2006 11:52 am

 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Wed Nov 29, 2006 1:43 pm

Stupid shell tricks FTW! If you have a set of Unix-style shell tools (like Cygwin) available, the following script will do it:
#!/bin/bash
find "$@" -type f -print | sed -e 's/^/sha1sum "/; s/$/"/' | bash | sort | uniq --all-repeated=separate --check-chars=40 | sed -e "s/[^ ]* .//"

(If your browser has wrapped the above code, note that the only line break is after the "#!/bin/bash"; the rest is all one long line.)

Just invoke the script, passing the names of one or more drives or directories on the command line. The script searches all of the listed drives/folders, and lists each group of duplicate files it finds.

It works by recursively walking all of the specified drives/folders, generating a 160-bit checksum for each file, then finding all groups of files which have matching checksums.

So, e.g. if you've saved it as a script named dupfiles, the command:
dupfiles d:/ e:/
would find all duplicate files on your D: and E: drives.

I love little scripting puzzles like this... and it is also an excellent illustration of why I install Cygwin on all of my Windows boxes, and why IMO everyone should learn how to use UNIX-style shell commands. You can accomplish a whole lot with very little code.
Nostalgia isn't what it used to be.
 
Flying Fox
Gerbil God
Posts: 25690
Joined: Mon May 24, 2004 2:19 am
Contact:

Wed Nov 29, 2006 3:38 pm

That doublekiller thing looks to be able to compare even by size and dates. :o
The Model M is not for the faint of heart. You either like them or hate them.

Gerbils unite! Fold for UnitedGerbilNation, team 2630.
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Wed Nov 29, 2006 3:40 pm

Flying Fox wrote:
That doublekiller thing looks to be able to compare even by size and dates. :o

Checksums are more reliable than looking at size and date though... :D
Nostalgia isn't what it used to be.
 
TheDVDMan
Graphmaster Gerbil
Posts: 1276
Joined: Fri Apr 30, 2004 2:34 pm
Location: Not TR anymore!

Wed Nov 29, 2006 4:12 pm

http://www.tucows.com/preview/373411

http://noclone.net/

There are lots of these types of apps floating around. However, I have yet to find one that does true byte-for-byte compares and doesn't produce false positives, and is easy to use.

So far the best I have come across - aside from a very few false postives - is the now very old ACDSee 3.2 with Duplicate file finder plugin. It works on more than just images, too.

Of course, finding an app that old is troublesome...
[/posting]
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Wed Nov 29, 2006 4:29 pm

I'm surprised that the tools give false positives; if the length of the files match, the tool should then do a byte-for-byte comparison of the contents to verify the match.

While false-positives are theoretically possible with a checksum-based approach like the one I gave the script for above, the odds are mathematically so low (it's a 160-bit hash, so the odds of getting a collision are vanishingly small) that practically speaking you'll never see one.
Nostalgia isn't what it used to be.
 
TheDVDMan
Graphmaster Gerbil
Posts: 1276
Joined: Fri Apr 30, 2004 2:34 pm
Location: Not TR anymore!

Wed Nov 29, 2006 5:16 pm

just brew it! wrote:
I'm surprised that the tools give false positives; if the length of the files match, the tool should then do a byte-for-byte comparison of the contents to verify the match.

While false-positives are theoretically possible with a checksum-based approach like the one I gave the script for above, the odds are mathematically so low (it's a 160-bit hash, so the odds of getting a collision are vanishingly small) that practically speaking you'll never see one.


Yup. I'm not sure what method ACDSee uses. It may only be a 32-bit checksum.

The other app the did give more than ACDSee did I can't remember the name of now. I think it's just called "DupFinder" or something.

I love ACDSee's dup finder; it has great options for auto-deleting the dups it finds...very few other apps seem to have that ability.
[/posting]

Who is online

Users browsing this forum: No registered users and 1 guest
GZIP: On