Dedupe
raw JSON → 3.0.3 verified Fri May 01 auth: no python
A Python library for accurate and scalable data deduplication and entity resolution. Version 3.0.3 requires Python >=3.8 and supports fuzzy matching, blocking, and active learning.
pip install dedupe Common errors
error ModuleNotFoundError: No module named 'dedupe' ↓
cause Library not installed or installed in a different environment.
fix
Run 'pip install dedupe' and ensure the correct Python environment is activated.
error AttributeError: module 'dedupe' has no attribute 'Dedupe' ↓
cause Incorrect import pattern (e.g., 'from dedupe import Dedupe') or using an older version.
fix
Use 'import dedupe' then access 'dedupe.Dedupe'.
error TypeError: 'str' object cannot be interpreted as an integer ↓
cause Passing string keys to data_d but dedupe expects integer keys.
fix
Ensure dictionary keys are integers (e.g., use enumerate).
Warnings
breaking In dedupe v2, the API changed significantly: 'Dedupe' class replaced old Dedupe function, and training/sampling methods were reworked. ↓
fix Update imports and method calls to match current API; refer to migration guide.
gotcha The 'Dedupe' class and related functions are not imported as submodules; always use 'import dedupe' then access e.g. dedupe.Dedupe. ↓
fix Use 'import dedupe' instead of 'from dedupe import Dedupe'.
deprecated The 'ConsoleLabel' training method is deprecated in favor of programmatic labeling via 'markPairs'. ↓
fix Use deduper.markPairs or provide pre-labeled data.
Imports
- Dedupe wrong
from dedupe import Dedupecorrectimport dedupe
Quickstart
import dedupe
import csv
data_d = {}
with open('input.csv') as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
data_d[i] = row
# Initialize deduper
deduper = dedupe.Dedupe([{'field': 'name', 'type': 'String'},
{'field': 'address', 'type': 'String'}])
# Training (sample)
deduper.sample(data_d, 10000)
# For labeled examples, use deduper.markPairs or load from file
deduper.train()
# Cluster
clustered = deduper.cluster(data_d)
print(clustered)