RecordLinkage

0.16 · active · verified Thu Apr 09

RecordLinkage is a powerful and modular Python toolkit for record linkage and duplicate detection. It provides methods for indexing, comparing records with various similarity measures, and classifying matches, leveraging pandas and numpy for efficient data handling. The library is actively maintained (version 0.16) and suitable for research and linking small to medium-sized datasets.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the core steps of record linkage: generating candidate pairs using blocking on a common attribute (city/town), comparing these pairs using string similarity and exact matches, and then classifying likely matches based on the comparison scores.

import recordlinkage
import pandas as pd

# Dummy data for demonstration
df_a = pd.DataFrame({
    'name': ['John Doe', 'Jane Smith', 'Peter Jones'],
    'city': ['New York', 'Los Angeles', 'Chicago']
}, index=['id_a_1', 'id_a_2', 'id_a_3'])

df_b = pd.DataFrame({
    'full_name': ['Jon Doe', 'Jane Smiht', 'Pete Jones'],
    'town': ['New York', 'Los Angles', 'Chicago']
}, index=['id_b_1', 'id_b_2', 'id_b_3'])

# 1. Indexing: Generate candidate links using blocking
indexer = recordlinkage.Index()
indexer.block(left_on='city', right_on='town')
candidate_links = indexer.index(df_a, df_b)

print(f'Generated {len(candidate_links)} candidate links.')

# 2. Comparing: Compare records on relevant attributes
compare_cl = recordlinkage.Compare()
compare_cl.string('name', 'full_name', method='jarowinkler', label='name_similarity')
compare_cl.exact('city', 'town', label='city_exact')

features = compare_cl.compute(candidate_links, df_a, df_b)

# 3. Classification: Decide on matches based on comparison features
# Simple sum of scores (1 for match, 0 for mismatch)
matches = features[features.sum(axis=1) > 1.5]

print('\nIdentified Matches:')
print(matches.index.to_list())

view raw JSON →