21

I have over 10000 images which about 2000 are duplicates in other formats (as in JPEG, PNG, GIF). Both of these numbers are increasing every day. I need to delete those duplicates and for that I must know how to find them first.

My first thought was to check an images pixels and find other pictures that have the same colored pixels in the same coordinates. But this option doesn't always work. Let's say I search for a duplicate. As for the searchable object I choose a 8 bit PNG file. It'll find all duplicates of that image, but only the 8 bit PNG, sometimes 8bit GIF and rarely JPEG (because of the images algorithmic I suppose?).

My second thought was to duplicate all of those images and recolor them in a strict two color palette (let's say black&white) and do the same scan as stated above. Yet again the JPEG image is not 100% similar to the PNG or GIF format (the same reason as above?).

The third thought was to decrease the percentage on how much does the image needs to be familiar and increase how much the colors can vary, resulting in unwanted image removal...

Any thoughts?

Aistis
  • 313
  • 1
  • 2
  • 5
  • 1
    http://www.mindgems.com/products/VS-Duplicate-Image-Finder/VSDIF-About.htm – Falcon Jul 14 '11 at 12:05
  • Formats with lossy compression will lead to images that are not 100% identical to losless versions. Must you have a commandline utility or could you run a gui program, that makes suggestions, then shows the images that have, say >90% similar pixels (calculate an average deviation)? (and of course pixel size should be identical in any format) – thorsten müller Jul 14 '11 at 12:09
  • 4
    http://stackoverflow.com/questions/2219185/duplicate-image-detection-algorithms –  Jul 14 '11 at 12:20
  • 1
    How many would have the same file name but different extension? – JeffO Jul 14 '11 at 13:00
  • 4
    Useful answer that doesn't require weeks of coding: http://stackoverflow.com/questions/596262/image-fingerprint-to-compare-similarity-of-many-images/1076647#1076647 – mac Jul 14 '11 at 13:16

4 Answers4

16

Perceptual hashes may be the answer:

http://www.phash.org/

A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar.

gnat
  • 21,442
  • 29
  • 112
  • 288
Joe
  • 466
  • 2
  • 6
8
  1. Check dimensions. If different => images are not the same.
  2. Check formats. If the same => Perform precise comparison, pixel by pixel.
  3. If different formats do this:

Do not compare RGB (red,green,blue). Compare Brightness as half the weight and compare color/hue as the other half (or 2/3rds vs 1/3rd). Calculate the difference in values and depending on 'tolerance' value they are the same or they are not.

JPEG heavily compresses the color information but tries not to ruin the liminance values.

Boris Yankov
  • 3,573
  • 1
  • 23
  • 29
6

When I was screening a bunch of images for dupes some years ago I found that reducing everything to 8x8 thumbnails and then computing a similarity score based on the square of the distance (treating the three colors separately) between the thumbnails worked pretty well. Note that you can hold a LOT of 8x8 thumbnails in memory.

Virtually all dupes scored below the non-dupes, about the only problems being some images that were very low contrast and similar overall even though the actual content varied (the background in each case was beach sand.)

This was also effective at catching images that were dupes except someone had reduced the resolution or quality on one in order to cut the file size.

Loren Pechtel
  • 3,371
  • 24
  • 19
  • 1
    Typically YUV is better than RGB, less sensitive to minor changes in colour balance. – Martin Beckett Jul 14 '11 at 17:03
  • This technique of thumbnails to pre-select potential matches is valid, YUV is a nice touch and I've seen it turned to a pure luminance map for the same reasons. – Patrick Hughes Jul 14 '11 at 20:01
  • @Martin Beckett: Sum of squares of RGB difference was the first thing I tried and it worked well enough that I didn't try to improve it--and at that it was catching dupes with editing. With a strict definition of dupe it was good enough that I would have let it auto-delete. – Loren Pechtel Jul 14 '11 at 20:48
  • @Loren, if they were minor pixel edits of the same image that should work. It's just that things like jpeg mess up RGB more than a YUV colour space. Just a tip ;-) – Martin Beckett Jul 14 '11 at 21:44
  • By nature, very dark pictures tend to have lower sum-of-squares-of-differences, even if they are not similar at all. The threshold might be adjusted with the average luminosity of the picture. I use this avg luminosity as a pre-filter to avoid O(n^2) image comparisons, so it's already there. – Gabriel Dec 15 '19 at 17:01
  • @Gabriel That wasn't a problem with my dataset, but in the more general case that might help. The O(n^2) time didn't take long enough to make it worth coming up with something better. – Loren Pechtel Dec 16 '19 at 02:25
0

Maybe you should write some code which scans the images for likeness. You could convert all the pics to ARGB format and compare them. (in memory)

A possible approach could be this way: Divide the pictures in zones. Scan the zones' average color and/or brightness to compare two pictures for likeness.

If more than say, 90% of the zones match, you chose one to move to the deletion candidate list. This way you have a list of candidates. You could use the pictures' aspect ratio to categorize the pictures in horizontal and vertical pictures to speed up comparisons. this way you can compensate for lossy algorithms not reproducing the right colors pixel by pixel. You run the program overnight, and in the morning you have it done :) in .Net this can be done quite easily with he GDI+ lib.

Onno
  • 1,523
  • 1
  • 12
  • 18