4

I have a huge amount of files (mostly documents like pdf ~80-90%, but also images, videos, webpages, audio etc.), somewhere around 3.8 millions of files which occupies ~7.8Tb of hard drive space on a 10Tb hdd.

I have tested a lot of software from the internet (Windows and Linux) for removing duplicate files but it's just in vain.

Most of them take days to complete, others can't even complete because of not enough memory to run, others are crushing when just ready to finish and so on, and others seems to never complete.

So I decided to wrote my own C++ program that can compile and run perfectly on both Linux and Windows but there is a problem: it also takes a lot of time and seems to never complete the task. It works very well on small amount of files.

I am writing this topic because maybe there is something I could improve/optimize/remove from my algorithm in order to make it extremely fast but also secure to in order to avoid collisions which means to remove different files thinking they are duplicates.

Here is the algorithm:

  • First of all it is listing all files recursively on the given path and store them by size into a dictionary where the key is an set of file paths corresponding to that size.

  • Second it removes the keys where the set has just a single value because they are just a single and unique file.

  • Next it does MD5 hash (which is faster) of the first 1024 bytes from the files and store them into another dictionary where the key is a pair of size and MD5 hash and their values are a set that contains the file paths of that files that have the same size and hash.

  • Next it is removing the keys that have a single value as they are unique.

  • Next it is doing the full SHA3-512 (using OpenSSL) of the full file and store them into another dictionary.

  • Next remove the keys that have a single value as they are unique, again.

  • Next proceed to remove the duplicates from the dictionary.

Everything is optimized perfectly and everything is done using multithreading but even so it takes a huge amount of time and it seems like it never completes the task.

What I should do to optimize it further?

Rohit Gupta
  • 203
  • 2
  • 3
  • 12
YoYoYo
  • 149
  • 5
  • 5
    You describe different steps. At every step you filter the collection of files. Have you timed the different steps and how large the intermediate collections are? – Pieter B Aug 17 '23 at 11:32
  • "store them by size into a dictionary" are you sure you mean a dictionary? I don't think a dictionary is the right data-structure here. A list which you fill and sort later would be better I think. – Pieter B Aug 17 '23 at 11:40
  • 5
    Guess what? The authors of those other programs aren't stupid, either. Some problems just take time and memory to solve. Some of that can be saved by implementing brilliant ideas, but at the end of the day, you will need to somehow enumerate all the files and keep information about them in (potentially paged, slow) memory. – Hans-Martin Mosner Aug 17 '23 at 12:00
  • 4
    How many actual duplicates are there? You will need to read every byte of every file with a duplicate. So if every file has at least one duplicate you will need to read all the data on the drive. With a 100Mb/s read speed that would be about a day. Probably a lot more since reading small files will cause a lot of seeking. – JonasH Aug 17 '23 at 13:00
  • 1
    @PieterB: I can think of two useful data structures here: a dictionary of sets of file paths, where the file size or hash is used as dictionary key (and not what the OP wrote, but I guess they meant *"into a dictionary where the **value** is a set of file paths corresponding to that size."*). Or a list sorted by file size / hash value. Both will allow to identify files of duplicate size / hash value quickly. – Doc Brown Aug 17 '23 at 13:04
  • 1
    Given typical hard drive speeds, it's likely that the hashing steps alone will take 11–30 hours. That is: the hashing itself is likely very fast, but bottlenecked by your drive speeds. So multithreading is unlikely to help much (beyond potentially helping with IO queue depth when reading small files). You are unlikely to be limited by RAM: 4 million hashes at 512 bits is only about 256 MB, so a separate MD5 pass is probably slowing you down more than it is speeding you up. Also note that this would allow you to remove duplicates immediately: `if (seen(hash)) remove(path) else insert(hash)` – amon Aug 17 '23 at 15:16
  • 1
    The only step which is not immanently slow in your above scenario is the first one, which uses the file size. Unfortunately, this is the only piece of meta data I can think of which will help you to distinguish non-duplicate files and which can be obtained without opening the files. So if that step will still leave you with a million candidates or more, you are screwed. – Doc Brown Aug 17 '23 at 15:37
  • 1
    @DocBrown There is also hard-link information, if that should be relevant, to identify aliases. – Deduplicator Aug 17 '23 at 20:24
  • 1
    Reading 16TB from a single HDD (2 passes over 8TB) could easily take 20hrs if read sequentially. Unfortunately you're reading 3.8M separate files which typically means the single actuator is bouncing all around the disk (even if it is defragmented!) and that is _really_ slow. Skip the MD5 pass to cut that in half. Use the suggestions in the answers to make this an interruptible/resumable process. – davidbak Aug 17 '23 at 23:27
  • 2
    Also be very careful with file name _aliases_. Other people have mentioned hard links. But that's the easy case - delete one of the "dupes" the other will still be there. The hard cases are when the same file has multiple names due to multiple _access paths_ - sometimes the same directory will show up twice, e.g., over a network mapped drive _and_ on a direct drive. In _that_ case deleting one of the "duplicates" has just deleted _the only copy of the file_. And there are other similar cases. Take care! **Make a backup before deleting anything!** – davidbak Aug 17 '23 at 23:30
  • 3
    Also be wary of files that happen to be the same *right now*, but should vary independantly – Caleth Aug 18 '23 at 08:34
  • @PieterB I said that it is a dictionary but in fact it is an unordered_map where the key is their size and value is a set of full file paths. Sorry for named it wrong but I like to program in Python more than in C++ but I choose C++ for this because it is faster. – YoYoYo Aug 19 '23 at 09:28
  • @JonasH I don't know how many duplicates there are. It already took almost one day and a half and I have just stopped it because it seems that never complete the task. The other programs are doing the same. They never finish. I have just running one very good program for almost two days and nothing yet, and it is claiming to be the fastest. Another one the same. It's a spinning HDD by the way, not a SSD. – YoYoYo Aug 19 '23 at 09:32
  • @Deduplicator what is hardlinking? What is that information and how it can be obtained? Could you give an example, please? Thank you in advance! – YoYoYo Aug 19 '23 at 09:38
  • Should I remove partial MD5 step completely in order to speed up de process? It is not a bad choice to do full SHA3-512 without intermediate step of partial MD5? I am asking this because I thought just a partial MD5 hash will filter a lot of files before wasting a lot of time to do full SHA3 hashing. Should I replace partially MD5 hash with partially XXH3 (I don't know if it is possible because I don't have SSE2 processor)? Does this hash have collisions? Should I replace SHA3 step and partially MD5 step with just XXH3? It will be faster? How safe is it? https://github.com/Cyan4973/xxHash – YoYoYo Aug 19 '23 at 09:50
  • @YoYoYo: hardlinks are different file paths which reference ("link to") the same physical file (so storage is only used once. In file systems supporting hardlinks, when you remove one of those links, the other links will still point to the same file, and file storage will obtained back only when you delete all of them. If you know for sure you don't have hardlinks, it would be a good idea to mention this. – Doc Brown Aug 19 '23 at 11:35
  • @YoYoYo: you may not know how many duplicates you have, but I guess know how many candidates from the 3.8 million files remain after the first step of your algo (removing candidates by size). Can you tell us this?l – Doc Brown Aug 19 '23 at 11:37
  • @DocBrown I need to modify the program to print this but there still are a lot of files because it is taking so much time for the other steps too. – YoYoYo Aug 20 '23 at 12:04
  • Yesterday just got an idea: thinking that the program needs to access the disk once to store the list of files, another time to get their sizes, another time to get their partially MD5 hash and finally to get their full SHA3-512 hash---> why not do just partially hashing alone (not MD5 but a stronger one like SHA3-512 which doesn't have collisions)? Could this result in false duplicates? There are already some programs claiming to do just partially hashing and being very fast. What about false duplicates? – YoYoYo Aug 20 '23 at 12:04
  • e.g. there are books that have first page or first two pages the same (advertising more likely) or last (one or two) pages or both but different. They will have the same partially hash even a stronger one? Any possible way to deal with this and being very accurate and very speedy? Thank you in advance! – YoYoYo Aug 20 '23 at 12:04
  • @YoYoYo: Several people have requested more information about the number of files and times for individual steps, and as far as I can see, you did not answer even one of those questions. Honestly, as long as you continue to guessing wildly around without actually producing some basic stats and numbers, you are doing this wrong. The first rule of optimization is **measure**. – Doc Brown Aug 20 '23 at 12:25
  • @DocBrown yes, true. Sorry. The problem is that it is taking to much time. I checked it just for the first step which means it filtered all the files for just their sizes and it detects only ~12.000 uniques and it took ~5 hours or so (sorry, didn't note exact results) and after that I stopped it because it never ends. – YoYoYo Aug 21 '23 at 12:12
  • @YoYoYo: 5 hours for determining the file sizes for 3.8 million files means roughly 211 files per second (the time for calculating the duplicates is most probably negligible, since it happens in-memory). That looks very slow to me. How is the HDD connected to the computer where your program runs? USB - which version? Sata / eSata - which version? Something else? – Doc Brown Aug 21 '23 at 13:39
  • @YoYoYo If file one is `dir1/foo` and file two is `dir2/bar` and they contain the same thing, should one of them be removed or is it only uniqueness within one directory that is the goal? Do you foresee the need for rerunning the program every now and then? – Ted Lyngmo Aug 31 '23 at 17:00

4 Answers4

10

Disk access, opening files and reading the contents is slow. You probably can't get around that. So however you look at it this program is probably going to be running for hours if not days.

Given that, and given that you are writing it yourself for a one off use you should write your program so that it can fail, be fixed and then continue running from where it left off.

Instead of using dictionaries, write to a database.

Split it into three programs or steps. list the files, add the md5s, add the sha3s, delete the files each step goes back to the db. That way you can see how many files you are going to delete/have the same hash etc and spot check your code is working correctly.

sample db table

Files
- fullFileNameAndPath
- fileName
- md5
- sha3
- duplicate
- deleted

Pass1

foreach(var file in directory)
{
   if(file is a folder) { recurse }
   db.Files.Add(file.FullName, file.Path);
}

Pass2

foreach(var files in db.GetNext100Files())
{
  var filedata = Disk.Open(file.fullFileName)
  var md5 = genMd5(filedata)
  db.UpdateFile(fullFileName, md5)
}

Pass3

foreach(var file in db.GetFilesWithMatchingMD5())
{
  var filedata = Disk.Open(file.fullFileName)
  var sha3 = genSha3(filedata)
  db.UpdateFile(fullFileName, sha3)
}

Pass 4

foreach(var fileCollection in db.GetFilesWithMatchingMD5AndSha3())
{
  //pick one to keep
  foreach(var file in fileCollection.Except(oneToKeep))
  {
    db.UpdateFile(fullFileName, duplicateIsTrue)
  }
}

Pass 5

foreach(var file in db.GetNext100Duplicates)
{
   Disk.Delete(file.fullFileName)
   db.UpdateFile(fullFileName, deleted)

}
Ewan
  • 70,664
  • 5
  • 76
  • 161
  • 3
    sqlite would be ideal for this – mikequentel Aug 17 '23 at 13:41
  • What exactly should it store into the database? This wouldn't be slower than doing all on the fly, at the runtime level? What to do after input data into database is finished? Thank you in advance! – YoYoYo Aug 18 '23 at 12:15
  • it might be slightly slower, but you wont run out of memory in the night and have to start again from scratch – Ewan Aug 18 '23 at 12:51
  • 1
    Basically the algorithm you wrote above it's just for a db management so it can continue the job if interrupted but it's already doing all these steps except that it is using am unordered_map of size-hash keys ->set of full file paths as values, and it is using threads. And it is extremely slow. You are right about days (even weeks, idk) but doing this way with SQLite will make it even more slower than it already is. I need it to be extremely fast and safe to use (detect duplicates accurately). Thank you! – YoYoYo Aug 19 '23 at 09:58
  • the db will not significantly affect the speed as the limit will be the disk access. – Ewan Aug 19 '23 at 10:08
  • if you want to improve the disk access by splitting the files over multiple machines you will need a central registry of the files and hashes, such as a db – Ewan Aug 19 '23 at 10:09
  • your biggest performance improvement will be in not having to rerun the program over already hashed files if it crashes during its days long run. For this you need to store the hashes, ie on a db – Ewan Aug 19 '23 at 10:10
  • the db also helps you stop running out of memory if your hashmap gets too big. – Ewan Aug 19 '23 at 10:12
  • @Ewan yes, this is true but in fact it doesn't solve anything because you need to access the disk the same amounts of times because once you need full file paths when listing the files, then also access it again for getting the file sizes, another time for doing partially MD5 hash, another time for doing again SHA3 hashing and again another time for deleting duplicates. Isn't it or did I misunderstood something about db which may skip the number of times it is accessing the hdd? Thank you in advance! – YoYoYo Aug 19 '23 at 10:14
  • it helps you solve the above highlighted problems. But yes. unless you move some files to other disks the file access time is probably your bottleneck. – Ewan Aug 19 '23 at 11:49
  • Yesterday just got an idea: thinking that the program needs to access the disk once to store the list of files, another time to get their sizes, another time to get their partially MD5 hash and finally to get their full SHA3-512 hash---> why not do just partially hashing alone (not MD5 but a stronger one like SHA3-512 which doesn't have collisions)? Could this result in false duplicates? There are already some programs claiming to do just partially hashing and being very fast. What about false duplicates? – YoYoYo Aug 20 '23 at 12:02
  • e.g. there are books that have first page or first two pages the same (advertising more likely) or last (one or two) pages or both but different. They will have the same partially hash even a stronger one? Any possible way to deal with this and being very accurate and very speedy? Thank you in advance! – YoYoYo Aug 20 '23 at 12:02
  • your overall method seems fine, I assumed you are doing the hash of the full file only when the md5 and other factors collided, because its the slowest step? – Ewan Aug 20 '23 at 14:28
  • if you are looking for exact matches you could just check the filesize – Ewan Aug 20 '23 at 14:29
  • @Ewan Exactly. I go further for SHA3 of the full file only partially MD5 hash and size still collide. – YoYoYo Aug 21 '23 at 12:05
2

I would start with two (or three) maps/tables (a dedicated DB might help. Maybe.):

Small files (full content), optionally hard-link info for candidate bigger files, and bigger files (just sizes).

The threshold should be small enough the file is likely to be resident, for file systems supporting that, but at least the size of your first hash. Unless of course you want to ignore small files.

Next, handle the duplicate small files, handle the hard-link info, and finally drop the info about known non-duplicate bigger files, getting leaner data-structures and saving memory.

Drill down into one cluster of candidate duplicate files after the other, thus you don't need to keep around the known duplicate trait, again keeping (resp. making) data-structures lean.

If you want to multi-thread, remember that one thread is probably plenty for hashing data from one disk, you just should always keep a read-request in-flight using asynchronous IO or a dedicated paired read-thread. More independent requests might add overhead, especially seeking on spinning disks is costly. Also, multiple disks might be behind a common link which becomes a bottleneck if they are fast or plenty enough.
A dedicated thread for collecting the results might help, maybe, if you don't use a database which already does that for you. At least it means that data bounces between CPUs less.

Deduplicator
  • 8,591
  • 5
  • 31
  • 50
  • What exactly should it store into the database? This wouldn't be slower than doing all on the fly, at the runtime level? What to do after input data into database is finished? Thank you in advance! – YoYoYo Aug 18 '23 at 12:16
  • "Small files (full content), optionally hard-link info for candidate bigger files, and bigger files (just sizes)." What exactly are small files? What the size limit for a file from where above it should be considered a big file? Why just sizes for bigger files? Thank you! – YoYoYo Aug 18 '23 at 12:19
  • Whether outsourcing the data-handling to a database is a win needs to be measured. Probably for much data and not too much adaptation. What exactly is a small file is a matter of experience. I gave you the two relevant corner-stones, but it has to be tuned right. And the small-file-list is an optimisation on the premise that hashing such small files wastes time. – Deduplicator Aug 18 '23 at 12:36
  • I tried to use it for -3 times and while running I checked it with ProcessExplorer: maximum amount of RAM (all time high recorded) was ~4Gb but usually it takes 200-350mb and most of the time just 45-65mb. I clear every vector and every unordered map if not needed anymore. CPU usage is not so high too, just 0.2-0.1%. HDD is just 0.1-0.2mb/second. But HDD is burning. You can't keep your hand for to long on its surface without any damages to you hands. And everything is very slow. That's the main problem: it is slow and I want it faster and without to remove false duplicates (collisions). – YoYoYo Aug 19 '23 at 10:06
  • Yesterday just got an idea: thinking that the program needs to access the disk once to store the list of files, another time to get their sizes, another time to get their partially MD5 hash and finally to get their full SHA3-512 hash---> why not do just partially hashing alone (not MD5 but a stronger one like SHA3-512 which doesn't have collisions)? Could this result in false duplicates? There are already some programs claiming to do just partially hashing and being very fast. What about false duplicates? – YoYoYo Aug 20 '23 at 12:01
  • e.g. there are books that have first page or first two pages the same (advertising more likely) or last (one or two) pages or both but different. They will have the same partially hash even a stronger one? Any possible way to deal with this and being very accurate and very speedy? Thank you in advance! – YoYoYo Aug 20 '23 at 12:01
2

When you write a big, hopefully perfect, program and you run it and it takes too long, it is time to break the problem into pieces to see what is taking the time.

Start by recursively scanning all the files in the given path, retrieving the file size of each and doing nothing at all with it.

Edit your question to tell us how long this takes. Our suggestions will depend a lot on what the answer is.

Also give us an idea of how big the largest file is. That will have a strong influence on what algorithms might be suggested.

One final point (until you provide the information asked for): when you are scanning a directory and find a subdirectory entry, do not scan it. Simply add it to a list of “subdirectories that will need to be scanned” and carry on looking through the original directory. When you have finished looking through it, and only then, take the “will need to be scanned” list and scan the subdirectories named in it. The reason for doing it like this is that it will save too much random seeking all over the disk.

And one final point - depending on the results of your “pure scan” timings. Do not waste time creating maps and filling them with file names, the first time round. Your very first pass should simply note which file sizes occur more than once. Then you will scan the directories again, and do something akin to what you have described… but only on files whose size is in the “more than once” list.

  • Yesterday just got an idea: thinking that the program needs to access the disk once to store the list of files, another time to get their sizes, another time to get their partially MD5 hash and finally to get their full SHA3-512 hash---> why not do just partially hashing alone (not MD5 but a stronger one like SHA3-512 which doesn't have collisions)? Could this result in false duplicates? There are already some programs claiming to do just partially hashing and being very fast. What about false duplicates? – YoYoYo Aug 20 '23 at 11:58
  • e.g. there are books that have first page or first two pages the same (advertising more likely) or last (one or two) pages or both but different. They will have the same partially hash even a stronger one? Any possible way to deal with this and being very accurate and very speedy? Thank you in advance! – YoYoYo Aug 20 '23 at 11:59
1

Most operating system have file system indices that can help you speed up the job. C++ does have its own way to determining file size, but I am not sure how optimized that is. The point is, you can detect some metadata without actually opening the file or reading anything from the disk or reading very little data. If you do not want to go platform specific, you can create the index yourself.

Using the index, you can then very easily get the size of a file. Then group your files by size. It makes no point to check files of different size against duplication, cause that is unlikely to be the case.

Then for each group of file sizes, you can compute any hash you want to determine duplication. The idea here is to exclude any non-duplicate items out the gate, before embarking into computationally intensive tasks.

Ccm
  • 115
  • 1
  • 5
  • Yesterday just got an idea: thinking that the program needs to access the disk once to store the list of files, another time to get their sizes, another time to get their partially MD5 hash and finally to get their full SHA3-512 hash---> why not do just partially hashing alone (not MD5 but a stronger one like SHA3-512 which doesn't have collisions)? Could this result in false duplicates? There are already some programs claiming to do just partially hashing and being very fast. What about false duplicates? – YoYoYo Aug 20 '23 at 12:00
  • e.g. there are books that have first page or first two pages the same (advertising more likely) or last (one or two) pages or both but different. They will have the same partially hash even a stronger one? Any possible way to deal with this and being very accurate and very speedy? Thank you in advance! – YoYoYo Aug 20 '23 at 12:00
  • 2
    Media files typically have headers which tell you stuff about encoding, so the first hundred bytes or so are always the same for identical file formats. E.G. most mp3s will have the same starting bytes, assuming same bit rate, sampling rate and channel count. The idea of using the file system index is that you do not really query the disk. You query a database that was already created in advance by the operating system, specifically to help applications avoid disk access. But the disadvantage to that is that you need to write platform specific code. – Ccm Aug 20 '23 at 17:40
  • 2
    Some metadata database services, like opensubtitles, ask you to compute a hash of the first and last series bytes in addition to file name. They still give false positives. The only way to know for sure if a file is identical or not to another is to read the whole file. – Ccm Aug 20 '23 at 17:42