{"id":4050,"library":"imagededup","title":"Image Deduplicator","description":"Imagededup is a Python package that simplifies finding exact and near-duplicate images in a collection. It offers various algorithms like perceptual hashing (PHash, DHash, WHash, AHash) and convolutional neural networks (CNNs) for robust deduplication. The package also includes an evaluation framework and plotting utilities for duplicates. The current version is 0.3.3.post2, and it maintains an active development cadence.","status":"active","version":"0.3.3.post2","language":"en","source_language":"en","source_url":"https://github.com/idealo/imagededup","tags":["image processing","deduplication","computer vision","hashing","machine learning"],"install":[{"cmd":"pip install imagededup","lang":"bash","label":"Install from PyPI"}],"dependencies":[],"imports":[{"symbol":"PHash","correct":"from imagededup.methods import PHash"},{"symbol":"DHash","correct":"from imagededup.methods import DHash"},{"symbol":"WHash","correct":"from imagededup.methods import WHash"},{"symbol":"AHash","correct":"from imagededup.methods import AHash"},{"symbol":"CNN","correct":"from imagededup.methods import CNN"},{"symbol":"plot_duplicates","correct":"from imagededup.utils import plot_duplicates"}],"quickstart":{"code":"import os\nfrom imagededup.methods import PHash\n\n# Create a dummy image directory and some dummy files for demonstration\n# In a real scenario, 'image_dir' would point to your directory of images.\nimage_dir = './my_images'\nos.makedirs(image_dir, exist_ok=True)\nwith open(os.path.join(image_dir, 'image1.jpg'), 'w') as f: f.write('dummy image content 1')\nwith open(os.path.join(image_dir, 'image2.png'), 'w') as f: f.write('dummy image content 2')\n# Add more dummy images as needed to test deduplication\n\n# Initialize a perceptual hashing method\nphasher = PHash()\n\n# Generate encodings for all images in the directory\nencodings = phasher.encode_images(image_dir=image_dir)\n\n# Find duplicates using the generated encodings\nduplicates = phasher.find_duplicates(encoding_map=encodings)\n\nprint(f\"Found duplicates: {duplicates}\")\n\n# Optionally, remove the dummy directory and files\n# import shutil\n# shutil.rmtree(image_dir)","lang":"python","description":"This quickstart demonstrates how to use the Perceptual Hashing (PHash) method to find duplicate images in a specified directory. It involves initializing the hashing method, generating image encodings, and then finding duplicates based on these encodings. The example includes creating a dummy directory and files for immediate execution."},"warnings":[{"fix":"Upgrade your Python environment to 3.9 or newer.","message":"Python 3.8 and older versions are no longer supported. Imagededup now requires Python 3.9 or higher.","severity":"breaking","affected_versions":"<=0.3.2"},{"fix":"Re-generate CNN encodings for your image collection when upgrading from older versions. Also, similarity thresholds might need adjustment for optimal performance with the new encodings.","message":"CNN encodings size has changed from 1024 to 576, and the underlying network might differ. This means CNN encodings from previous versions are incompatible.","severity":"breaking","affected_versions":">=0.3.0"},{"fix":"Re-generate all image hashes if upgrading from a version older than 0.3.0.","message":"Hashes (all types) generated by versions >=0.3.0 may differ from previous versions for a given image, making older hash maps incompatible.","severity":"breaking","affected_versions":">=0.3.0"},{"fix":"For very large datasets, consider using approximate nearest neighbor libraries (e.g., FAISS) or alternative strategies if memory becomes a bottleneck. Benchmarking with your specific dataset is recommended.","message":"When using CNNs for deduplication, memory usage can increase quadratically with the number of images during retrieval (cosine similarity matrix calculation).","severity":"gotcha","affected_versions":"All"},{"fix":"Experiment with different threshold values for your specific dataset. The documentation suggests `min_similarity_threshold` around 0.9 for CNNs and `max_distance_threshold` of 0 for exact duplicates with hashing methods.","message":"The optimal `min_similarity_threshold` for CNN and `max_distance_threshold` for hashing methods can vary significantly based on your dataset and the type of duplicates (exact vs. near-duplicates) you are trying to find.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}