Dedupe

raw JSON →
3.0.3 verified Fri May 01 auth: no python

A Python library for accurate and scalable data deduplication and entity resolution. Version 3.0.3 requires Python >=3.8 and supports fuzzy matching, blocking, and active learning.

pip install dedupe
error ModuleNotFoundError: No module named 'dedupe'
cause Library not installed or installed in a different environment.
fix
Run 'pip install dedupe' and ensure the correct Python environment is activated.
error AttributeError: module 'dedupe' has no attribute 'Dedupe'
cause Incorrect import pattern (e.g., 'from dedupe import Dedupe') or using an older version.
fix
Use 'import dedupe' then access 'dedupe.Dedupe'.
error TypeError: 'str' object cannot be interpreted as an integer
cause Passing string keys to data_d but dedupe expects integer keys.
fix
Ensure dictionary keys are integers (e.g., use enumerate).
breaking In dedupe v2, the API changed significantly: 'Dedupe' class replaced old Dedupe function, and training/sampling methods were reworked.
fix Update imports and method calls to match current API; refer to migration guide.
gotcha The 'Dedupe' class and related functions are not imported as submodules; always use 'import dedupe' then access e.g. dedupe.Dedupe.
fix Use 'import dedupe' instead of 'from dedupe import Dedupe'.
deprecated The 'ConsoleLabel' training method is deprecated in favor of programmatic labeling via 'markPairs'.
fix Use deduper.markPairs or provide pre-labeled data.

Basic dedupe workflow: load data, define fields, sample, train, and cluster duplicates.

import dedupe
import csv

data_d = {}
with open('input.csv') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        data_d[i] = row

# Initialize deduper
deduper = dedupe.Dedupe([{'field': 'name', 'type': 'String'},
                         {'field': 'address', 'type': 'String'}])

# Training (sample)
deduper.sample(data_d, 10000)
# For labeled examples, use deduper.markPairs or load from file

deduper.train()

# Cluster
clustered = deduper.cluster(data_d)
print(clustered)