{"id":5499,"library":"splink","title":"Splink","description":"Splink is a Python package for fast, accurate, and scalable probabilistic record linkage (entity resolution). It enables users to deduplicate and link records from datasets that lack unique identifiers, leveraging unsupervised learning based on the Fellegi-Sunter model. Splink supports various SQL backends like DuckDB, Apache Spark, and AWS Athena, allowing it to scale to datasets of 100 million records or more, and provides a suite of interactive visualizations for model understanding and diagnostics.","status":"active","version":"4.0.16","language":"en","source_language":"en","source_url":"https://github.com/moj-analytical-services/splink","tags":["data linkage","entity resolution","deduplication","probabilistic matching","big data","sql","data science"],"install":[{"cmd":"pip install splink","lang":"bash","label":"Base Install (includes DuckDB and SQLite)"},{"cmd":"pip install 'splink[spark]'","lang":"bash","label":"For Apache Spark backend"},{"cmd":"pip install 'splink[athena]'","lang":"bash","label":"For AWS Athena backend"},{"cmd":"pip install 'splink[postgres]'","lang":"bash","label":"For PostgreSQL backend"}],"dependencies":[{"reason":"Default high-performance SQL backend, bundled with base install.","package":"duckdb"},{"reason":"Bundled SQL backend, suited for smaller datasets.","package":"sqlite3"},{"reason":"Optional backend for big data processing (installed with 'splink[spark]').","package":"pyspark"},{"reason":"Optional backend for AWS Athena (installed with 'splink[athena]'). Note: support is being dropped in v5.","package":"pyathena"},{"reason":"Optional backend for PostgreSQL (installed with 'splink[postgres]').","package":"psycopg2-binary"},{"reason":"Used for SQL transpilation to ensure compatibility across multiple SQL engines.","package":"sqlglot"}],"imports":[{"symbol":"Linker","correct":"from splink import Linker"},{"symbol":"SettingsCreator","correct":"from splink import SettingsCreator"},{"symbol":"block_on","correct":"from splink import block_on"},{"symbol":"DuckDBAPI","correct":"from splink import DuckDBAPI"},{"symbol":"splink_datasets","correct":"from splink import splink_datasets"},{"symbol":"cl","correct":"import splink.comparison_library as cl"}],"quickstart":{"code":"import splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndb_api = DuckDBAPI()\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.NameComparison(\"first_name\"),\n        cl.JaroAtThresholds(\"surname\"),\n        cl.DateOfBirthComparison(\"dob\", input_is_string=True),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"dob\"),\n        block_on(\"surname\"),\n    ]\n)\n\nlinker = Linker(df, settings, db_api)\n\nlinker.training.estimate_probability_two_random_records_match(\n    [block_on(\"first_name\", \"surname\")], recall=0.7\n)\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"first_name\", \"surname\")\n)\n\n# To get the results, e.g., predictions_df = linker.inference.predict()","lang":"python","description":"This quickstart demonstrates how to set up a basic Splink deduplication model using DuckDB. It covers defining comparison libraries and blocking rules, estimating parameters for record linkage, and preparing for prediction. It uses a built-in `fake_1000` dataset for convenience."},"warnings":[{"fix":"Review v5.0 documentation and migration guides for updated API calls, cache management, and probabilistic calculation handling. Users relying on Athena should plan for migration to another backend or use Splink v4.x.","message":"Splink v5.0 introduces significant breaking changes. Key updates include the removal of the implicit cache mechanism in favor of explicit cache table management functions, removal of 'salting', introduction of 'chunking' for large datasets, and a shift from Bayes Factors to Match Weights (log-odds) for internal probabilistic calculations to improve numerical stability. Additionally, support for the Athena backend is being dropped.","severity":"breaking","affected_versions":">=5.0.0"},{"fix":"Ensure your environment uses Python 3.9 or higher. The current requirement is `>=3.9.0, <4.0.0`.","message":"Python 3.8 support was dropped in Splink v4.0.12. Older versions of Python are being phased out in alignment with community end-of-life policies.","severity":"breaking","affected_versions":"<4.0.12 (Python 3.8)"},{"fix":"Pre-process data to ensure multiple, diverse columns are used for linkage. Avoid relying on highly correlated features or single 'bag of words' columns for optimal accuracy.","message":"Splink performs best with input data containing multiple, non-highly correlated columns. It is not designed for linking single-column 'bag of words' data (e.g., only a company name). High correlation (e.g., city and postcode) can also reduce effectiveness.","severity":"gotcha","affected_versions":"All"},{"fix":"For optimal performance and feature coverage, especially with larger datasets or complex comparisons, consider using DuckDB (default) or other actively supported backends like Spark or PostgreSQL.","message":"SQLite backend support is minimal and receives less attention from the development team compared to DuckDB and Spark. It has reasonable but not complete coverage of comparison functions, particularly for array and date comparisons.","severity":"deprecated","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-13T00:00:00.000Z","next_check":"2026-07-12T00:00:00.000Z"}