I have a huge amount of files (mostly documents like pdf ~80-90%, but also images, videos, webpages, audio etc.), somewhere around 3.8 millions of files which occupies ~7.8Tb of hard drive space on a 10Tb hdd.
I have tested a lot of software from the internet (Windows and Linux) for removing duplicate files but it's just in vain.
Most of them take days to complete, others can't even complete because of not enough memory to run, others are crushing when just ready to finish and so on, and others seems to never complete.
So I decided to wrote my own C++ program that can compile and run perfectly on both Linux and Windows but there is a problem: it also takes a lot of time and seems to never complete the task. It works very well on small amount of files.
I am writing this topic because maybe there is something I could improve/optimize/remove from my algorithm in order to make it extremely fast but also secure to in order to avoid collisions which means to remove different files thinking they are duplicates.
Here is the algorithm:
First of all it is listing all files recursively on the given path and store them by size into a dictionary where the key is an set of file paths corresponding to that size.
Second it removes the keys where the set has just a single value because they are just a single and unique file.
Next it does MD5 hash (which is faster) of the first 1024 bytes from the files and store them into another dictionary where the key is a pair of size and MD5 hash and their values are a set that contains the file paths of that files that have the same size and hash.
Next it is removing the keys that have a single value as they are unique.
Next it is doing the full SHA3-512 (using OpenSSL) of the full file and store them into another dictionary.
Next remove the keys that have a single value as they are unique, again.
Next proceed to remove the duplicates from the dictionary.
Everything is optimized perfectly and everything is done using multithreading but even so it takes a huge amount of time and it seems like it never completes the task.
What I should do to optimize it further?