I have over 10000 images which about 2000 are duplicates in other formats (as in JPEG, PNG, GIF). Both of these numbers are increasing every day. I need to delete those duplicates and for that I must know how to find them first.
My first thought was to check an images pixels and find other pictures that have the same colored pixels in the same coordinates. But this option doesn't always work. Let's say I search for a duplicate. As for the searchable object I choose a 8 bit PNG file. It'll find all duplicates of that image, but only the 8 bit PNG, sometimes 8bit GIF and rarely JPEG (because of the images algorithmic I suppose?).
My second thought was to duplicate all of those images and recolor them in a strict two color palette (let's say black&white) and do the same scan as stated above. Yet again the JPEG image is not 100% similar to the PNG or GIF format (the same reason as above?).
The third thought was to decrease the percentage on how much does the image needs to be familiar and increase how much the colors can vary, resulting in unwanted image removal...
Any thoughts?