Image Deduplicator
Imagededup is a Python package that simplifies finding exact and near-duplicate images in a collection. It offers various algorithms like perceptual hashing (PHash, DHash, WHash, AHash) and convolutional neural networks (CNNs) for robust deduplication. The package also includes an evaluation framework and plotting utilities for duplicates. The current version is 0.3.3.post2, and it maintains an active development cadence.
Warnings
- breaking Python 3.8 and older versions are no longer supported. Imagededup now requires Python 3.9 or higher.
- breaking CNN encodings size has changed from 1024 to 576, and the underlying network might differ. This means CNN encodings from previous versions are incompatible.
- breaking Hashes (all types) generated by versions >=0.3.0 may differ from previous versions for a given image, making older hash maps incompatible.
- gotcha When using CNNs for deduplication, memory usage can increase quadratically with the number of images during retrieval (cosine similarity matrix calculation).
- gotcha The optimal `min_similarity_threshold` for CNN and `max_distance_threshold` for hashing methods can vary significantly based on your dataset and the type of duplicates (exact vs. near-duplicates) you are trying to find.
Install
-
pip install imagededup
Imports
- PHash
from imagededup.methods import PHash
- DHash
from imagededup.methods import DHash
- WHash
from imagededup.methods import WHash
- AHash
from imagededup.methods import AHash
- CNN
from imagededup.methods import CNN
- plot_duplicates
from imagededup.utils import plot_duplicates
Quickstart
import os
from imagededup.methods import PHash
# Create a dummy image directory and some dummy files for demonstration
# In a real scenario, 'image_dir' would point to your directory of images.
image_dir = './my_images'
os.makedirs(image_dir, exist_ok=True)
with open(os.path.join(image_dir, 'image1.jpg'), 'w') as f: f.write('dummy image content 1')
with open(os.path.join(image_dir, 'image2.png'), 'w') as f: f.write('dummy image content 2')
# Add more dummy images as needed to test deduplication
# Initialize a perceptual hashing method
phasher = PHash()
# Generate encodings for all images in the directory
encodings = phasher.encode_images(image_dir=image_dir)
# Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)
print(f"Found duplicates: {duplicates}")
# Optionally, remove the dummy directory and files
# import shutil
# shutil.rmtree(image_dir)