{"id":2258,"library":"recordlinkage","title":"RecordLinkage","description":"RecordLinkage is a powerful and modular Python toolkit for record linkage and duplicate detection. It provides methods for indexing, comparing records with various similarity measures, and classifying matches, leveraging pandas and numpy for efficient data handling. The library is actively maintained (version 0.16) and suitable for research and linking small to medium-sized datasets.","status":"active","version":"0.16","language":"en","source_language":"en","source_url":"https://github.com/J535D165/recordlinkage","tags":["record linkage","deduplication","data matching","fuzzy matching","data cleaning","pandas","data science","entity resolution"],"install":[{"cmd":"pip install recordlinkage","lang":"bash","label":"Standard Install"},{"cmd":"pip install recordlinkage['all']","lang":"bash","label":"Install with Recommended and Optional Dependencies"}],"dependencies":[{"reason":"Core numerical operations.","package":"numpy","optional":false},{"reason":"Primary data handling and manipulation.","package":"pandas","optional":false},{"reason":"Scientific computing functionalities.","package":"scipy","optional":false},{"reason":"Machine learning algorithms for classification.","package":"sklearn","optional":false},{"reason":"Optimized string comparison algorithms.","package":"jellyfish","optional":false},{"reason":"Parallel computing and caching.","package":"joblib","optional":false},{"reason":"Accelerating numerical operations (recommended for performance).","package":"numexpr","optional":true},{"reason":"Accelerating NaN evaluations (recommended for performance).","package":"bottleneck","optional":true}],"imports":[{"note":"Main module import.","symbol":"recordlinkage","correct":"import recordlinkage"},{"note":"To create an indexer object.","symbol":"Index","correct":"from recordlinkage import Index"},{"note":"Direct import for specific indexing algorithms.","symbol":"Block","correct":"from recordlinkage.index import Block"},{"note":"To create a comparator object.","symbol":"Compare","correct":"from recordlinkage import Compare"},{"note":"Direct import for specific comparison measures.","symbol":"Exact","correct":"from recordlinkage.compare import Exact"}],"quickstart":{"code":"import recordlinkage\nimport pandas as pd\n\n# Dummy data for demonstration\ndf_a = pd.DataFrame({\n    'name': ['John Doe', 'Jane Smith', 'Peter Jones'],\n    'city': ['New York', 'Los Angeles', 'Chicago']\n}, index=['id_a_1', 'id_a_2', 'id_a_3'])\n\ndf_b = pd.DataFrame({\n    'full_name': ['Jon Doe', 'Jane Smiht', 'Pete Jones'],\n    'town': ['New York', 'Los Angles', 'Chicago']\n}, index=['id_b_1', 'id_b_2', 'id_b_3'])\n\n# 1. Indexing: Generate candidate links using blocking\nindexer = recordlinkage.Index()\nindexer.block(left_on='city', right_on='town')\ncandidate_links = indexer.index(df_a, df_b)\n\nprint(f'Generated {len(candidate_links)} candidate links.')\n\n# 2. Comparing: Compare records on relevant attributes\ncompare_cl = recordlinkage.Compare()\ncompare_cl.string('name', 'full_name', method='jarowinkler', label='name_similarity')\ncompare_cl.exact('city', 'town', label='city_exact')\n\nfeatures = compare_cl.compute(candidate_links, df_a, df_b)\n\n# 3. Classification: Decide on matches based on comparison features\n# Simple sum of scores (1 for match, 0 for mismatch)\nmatches = features[features.sum(axis=1) > 1.5]\n\nprint('\\nIdentified Matches:')\nprint(matches.index.to_list())","lang":"python","description":"This quickstart demonstrates the core steps of record linkage: generating candidate pairs using blocking on a common attribute (city/town), comparing these pairs using string similarity and exact matches, and then classifying likely matches based on the comparison scores."},"warnings":[{"fix":"Review the release notes for version 0.15 to update deprecated class usages and ensure your environment meets the new minimum Python (3.8+) and pandas (1.0+) requirements.","message":"Version 0.15 removed several deprecated classes and bumped the minimum required Python version to 3.8 and pandas version to >=1.0. Older code using removed classes or incompatible Python/pandas versions will break.","severity":"breaking","affected_versions":">=0.15"},{"fix":"Carefully select blocking keys, considering multiple keys (union of blocks), using more error-tolerant indexing methods (e.g., sorted neighbourhood, phonetic blocking), or applying pre-processing to standardize blocking fields.","message":"The choice of blocking keys can introduce significant bias, as true matches that do not share the exact same blocking key will be missed. This is especially problematic with data entry errors or variations.","severity":"gotcha","affected_versions":"All"},{"fix":"Benchmark different string comparison methods for your specific data and choose a balance between accuracy and performance. Prioritize 'jellyfish' for string comparisons and ensure the Rust-backed version is active (check `import jellyfish.rustyfish` does not raise an exception).","message":"String comparison performance varies greatly between algorithms. Highly accurate but computationally intensive algorithms like Damerau-Levenshtein can be much slower than Jaro-Winkler or Jaro, especially on large datasets. Ensure the 'jellyfish' library (Rust version) is correctly installed for optimal string comparison speed.","severity":"gotcha","affected_versions":"All"},{"fix":"Implement robust pre-processing steps using `recordlinkage.standardise` or other data cleaning libraries to normalize names, addresses, dates, and other attributes before attempting to link records.","message":"Effective record linkage heavily depends on prior data cleaning and standardization. Inconsistent formatting, typos, missing values, or variations in how information is recorded (e.g., 'St.' vs 'Street') can severely hinder matching accuracy.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}