RecordLinkage
RecordLinkage is a powerful and modular Python toolkit for record linkage and duplicate detection. It provides methods for indexing, comparing records with various similarity measures, and classifying matches, leveraging pandas and numpy for efficient data handling. The library is actively maintained (version 0.16) and suitable for research and linking small to medium-sized datasets.
Warnings
- breaking Version 0.15 removed several deprecated classes and bumped the minimum required Python version to 3.8 and pandas version to >=1.0. Older code using removed classes or incompatible Python/pandas versions will break.
- gotcha The choice of blocking keys can introduce significant bias, as true matches that do not share the exact same blocking key will be missed. This is especially problematic with data entry errors or variations.
- gotcha String comparison performance varies greatly between algorithms. Highly accurate but computationally intensive algorithms like Damerau-Levenshtein can be much slower than Jaro-Winkler or Jaro, especially on large datasets. Ensure the 'jellyfish' library (Rust version) is correctly installed for optimal string comparison speed.
- gotcha Effective record linkage heavily depends on prior data cleaning and standardization. Inconsistent formatting, typos, missing values, or variations in how information is recorded (e.g., 'St.' vs 'Street') can severely hinder matching accuracy.
Install
-
pip install recordlinkage -
pip install recordlinkage['all']
Imports
- recordlinkage
import recordlinkage
- Index
from recordlinkage import Index
- Block
from recordlinkage.index import Block
- Compare
from recordlinkage import Compare
- Exact
from recordlinkage.compare import Exact
Quickstart
import recordlinkage
import pandas as pd
# Dummy data for demonstration
df_a = pd.DataFrame({
'name': ['John Doe', 'Jane Smith', 'Peter Jones'],
'city': ['New York', 'Los Angeles', 'Chicago']
}, index=['id_a_1', 'id_a_2', 'id_a_3'])
df_b = pd.DataFrame({
'full_name': ['Jon Doe', 'Jane Smiht', 'Pete Jones'],
'town': ['New York', 'Los Angles', 'Chicago']
}, index=['id_b_1', 'id_b_2', 'id_b_3'])
# 1. Indexing: Generate candidate links using blocking
indexer = recordlinkage.Index()
indexer.block(left_on='city', right_on='town')
candidate_links = indexer.index(df_a, df_b)
print(f'Generated {len(candidate_links)} candidate links.')
# 2. Comparing: Compare records on relevant attributes
compare_cl = recordlinkage.Compare()
compare_cl.string('name', 'full_name', method='jarowinkler', label='name_similarity')
compare_cl.exact('city', 'town', label='city_exact')
features = compare_cl.compute(candidate_links, df_a, df_b)
# 3. Classification: Decide on matches based on comparison features
# Simple sum of scores (1 for match, 0 for mismatch)
matches = features[features.sum(axis=1) > 1.5]
print('\nIdentified Matches:')
print(matches.index.to_list())