For a single server, the trick to keep snapshots in history of your server backup without losing a lot of space, is to rsync to directory current
and cp -al current oldbackup_20120501
. This allows rsyncing to current, and keep oldbackup made out of hardlinks until current changes to something different.
While this served me well, turns out it wasn't perfect, there were some admin errors in the past, and duplicates across different servers backed up. So, I looked for dupe finders so that I can re-hardlink identical files after the fact.
The first thing I quickly found was that comparing all files with the same size was going to be way way too slow, so I had to limit the deduping to files that had different names, or the pool of files to dedupe would just be way too big.
hardlink.py -c -f -x options.txt -x Makefile dir1 dir2 dir3
. Its one flaw right now is that it runs out of RAM on my 4GB system when run on 27 million files. To save on time for deduping system backups, it's useful to tell hardlinks.py to only compare files with the same name.hardlink -v -f -p -t -x options.txt -x Makefile dir1 dir2
. Mmmh, that one took so much memory on my 4GB server that within 20mn it was swapping hard.hardlinks.py is my favourite for now, over several days of runs (afterall, there are many files to scan/compare), I've already saved 5,646,995 files and about 300GB, not bad :)