Marc's Blog: linux - Handy tip to save on inodes and disk space: finddupes, fdupes, and hardlink.py

I've been rsyncing my linux machines on my disk server for the last 10 years, and while I've tried to save space by using the trick below, clearly it hadn't applied carefully everywhere, and it didn't consolidate files across backups from multiple servers.

For a single server, the trick to keep snapshots in history of your server backup without losing a lot of space, is to rsync to directory current and cp -al current oldbackup_20120501. This allows rsyncing to current, and keep oldbackup made out of hardlinks until current changes to something different.

While this served me well, turns out it wasn't perfect, there were some admin errors in the past, and duplicates across different servers backed up. So, I looked for dupe finders so that I can re-hardlink identical files after the fact.
The first thing I quickly found was that comparing all files with the same size was going to be way way too slow, so I had to limit the deduping to files that had different names, or the pool of files to dedupe would just be way too big.

apt-get install fdupes: has lots of options for recursive scanning, can delete, hardlink, or even symlink. I could not find how to tell it to only compare files with the same names.

http://code.google.com/p/hardlinkpy/ : it's in python, but it actually runs faster than fdupes for me, and has useful options to work on huge trees: hardlink.py -c -f -x options.txt -x Makefile dir1 dir2 dir3. Its one flaw right now is that it runs out of RAM on my 4GB system when run on 27 million files. To save on time for deduping system backups, it's useful to tell hardlinks.py to only compare files with the same name.

http://www.pixelbeat.org/fslint/ : I didn't try this one but it looked nice when you need a GUI.

http://svn.red-bean.com/bbum/trunk/hacques/dupinator.py : is a simple python script you can hack on if you just need to find dupes and act on them.

apt-get install hardlink (yes, another one). hardlink -v -f -p -t -x options.txt -x Makefile dir1 dir2. Mmmh, that one took so much memory on my 4GB server that within 20mn it was swapping hard.

hardlinks.py is my favourite for now, over several days of runs (afterall, there are many files to scan/compare), I've already saved 5,646,995 files and about 300GB, not bad :)

Marc's Public Blog - Linux Hacking