Splink
Splink is a Python package for fast, accurate, and scalable probabilistic record linkage (entity resolution). It enables users to deduplicate and link records from datasets that lack unique identifiers, leveraging unsupervised learning based on the Fellegi-Sunter model. Splink supports various SQL backends like DuckDB, Apache Spark, and AWS Athena, allowing it to scale to datasets of 100 million records or more, and provides a suite of interactive visualizations for model understanding and diagnostics.
Warnings
- breaking Splink v5.0 introduces significant breaking changes. Key updates include the removal of the implicit cache mechanism in favor of explicit cache table management functions, removal of 'salting', introduction of 'chunking' for large datasets, and a shift from Bayes Factors to Match Weights (log-odds) for internal probabilistic calculations to improve numerical stability. Additionally, support for the Athena backend is being dropped.
- breaking Python 3.8 support was dropped in Splink v4.0.12. Older versions of Python are being phased out in alignment with community end-of-life policies.
- gotcha Splink performs best with input data containing multiple, non-highly correlated columns. It is not designed for linking single-column 'bag of words' data (e.g., only a company name). High correlation (e.g., city and postcode) can also reduce effectiveness.
- deprecated SQLite backend support is minimal and receives less attention from the development team compared to DuckDB and Spark. It has reasonable but not complete coverage of comparison functions, particularly for array and date comparisons.
Install
-
pip install splink -
pip install 'splink[spark]' -
pip install 'splink[athena]' -
pip install 'splink[postgres]'
Imports
- Linker
from splink import Linker
- SettingsCreator
from splink import SettingsCreator
- block_on
from splink import block_on
- DuckDBAPI
from splink import DuckDBAPI
- splink_datasets
from splink import splink_datasets
- cl
import splink.comparison_library as cl
Quickstart
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
db_api = DuckDBAPI()
df = splink_datasets.fake_1000
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.NameComparison("first_name"),
cl.JaroAtThresholds("surname"),
cl.DateOfBirthComparison("dob", input_is_string=True),
cl.ExactMatch("city").configure(term_frequency_adjustments=True),
cl.EmailComparison("email"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "dob"),
block_on("surname"),
]
)
linker = Linker(df, settings, db_api)
linker.training.estimate_probability_two_random_records_match(
[block_on("first_name", "surname")], recall=0.7
)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(
block_on("first_name", "surname")
)
# To get the results, e.g., predictions_df = linker.inference.predict()