{"id":4910,"library":"collate-data-diff","title":"Data Diff (collate-data-diff)","description":"collate-data-diff (also known as data-diff) is a Python library and command-line tool designed to efficiently compare and find differences between rows across two databases or tables. It focuses on performance and scalability for large datasets, providing a fast and accurate way to detect data discrepancies. The current version is 0.11.10, and it maintains an active release cadence with frequent updates.","status":"active","version":"0.11.10","language":"en","source_language":"en","source_url":"https://github.com/datafold/data-diff","tags":["database","diff","data comparison","data quality","etl","data validation"],"install":[{"cmd":"pip install collate-data-diff","lang":"bash","label":"Core library"},{"cmd":"pip install \"collate-data-diff[sqlite]\" \"collate-data-diff[postgresql]\" \"collate-data-diff[mysql]\" \"collate-data-diff[snowflake]\"","lang":"bash","label":"With common database drivers (e.g., SQLite, PostgreSQL, MySQL, Snowflake)"}],"dependencies":[{"reason":"Requires Python 3.10 or newer, but less than Python 4.0.","package":"python","optional":false},{"reason":"Optional dependency for PostgreSQL connectivity.","package":"collate-data-diff[postgresql]","optional":true},{"reason":"Optional dependency for MySQL connectivity.","package":"collate-data-diff[mysql]","optional":true},{"reason":"Optional dependency for Snowflake connectivity.","package":"collate-data-diff[snowflake]","optional":true},{"reason":"Optional dependency for Redshift connectivity.","package":"collate-data-diff[redshift]","optional":true},{"reason":"Optional dependency for SQL Server connectivity.","package":"collate-data-diff[mssql]","optional":true},{"reason":"Optional dependency for SQLite connectivity (implicitly via sqlalchemy).","package":"collate-data-diff[sqlite]","optional":true}],"imports":[{"symbol":"diff_tables","correct":"from data_diff import diff_tables"},{"symbol":"TableSegment","correct":"from data_diff import TableSegment"},{"symbol":"connect_to_uri","correct":"from data_diff import connect_to_uri"}],"quickstart":{"code":"import pandas as pd\nimport sqlalchemy as sa\nfrom data_diff import diff_tables, TableSegment, connect_to_uri\n\n# Create two in-memory SQLite databases for demonstration\nuri_1 = \"sqlite:///:memory:\"\nengine_1 = sa.create_engine(uri_1)\nconn_1 = engine_1.connect()\n\nuri_2 = \"sqlite:///:memory:\"\nengine_2 = sa.create_engine(uri_2)\nconn_2 = engine_2.connect()\n\n# Create dummy data in pandas DataFrames\ndf1 = pd.DataFrame({\n    'id': [1, 2, 3, 4],\n    'name': ['Alice', 'Bob', 'Charlie', 'David'],\n    'value': [10, 20, 30, 40]\n})\ndf2 = pd.DataFrame({\n    'id': [1, 2, 3, 5], # id 4 removed, id 5 added\n    'name': ['Alice', 'Bob Changed', 'Charlie', 'Eve'], # Bob's name changed\n    'value': [10, 25, 30, 50] # Bob's value changed\n})\n\n# Populate tables in the in-memory databases\ndf1.to_sql('table_a', conn_1, index=False, if_exists='replace')\ndf2.to_sql('table_b', conn_2, index=False, if_exists='replace')\n\n# Connect to the in-memory databases using data_diff's `connect_to_uri`\ndb1 = connect_to_uri(uri_1)\ndb2 = connect_to_uri(uri_2)\n\n# Define TableSegment objects, specifying the key_columns for efficient diffing\ntable1 = TableSegment(db1, 'table_a', key_columns='id')\ntable2 = TableSegment(db2, 'table_b', key_columns='id')\n\n# Perform the diff and print results\nprint(\"Differences found:\")\ndiff_results = list(diff_tables(table1, table2))\nfor d in diff_results:\n    print(d)\n\n# Expected Output:\n# (- , 4, 'David', 40)  # Row removed from table_a\n# (+ , 5, 'Eve', 50)    # Row added to table_b\n# (- , 2, 'Bob', 20)    # Original row for ID 2\n# (+ , 2, 'Bob Changed', 25) # Modified row for ID 2","lang":"python","description":"This quickstart demonstrates how to use `data-diff` programmatically to compare two tables in in-memory SQLite databases. It requires `collate-data-diff`, `pandas` (for creating dummy dataframes), and `sqlalchemy` (for database connectivity). Install with `pip install collate-data-diff[sqlite] pandas sqlalchemy`."},"warnings":[{"fix":"Adjust code that processes the output of `diff_tables` to expect single changes as lists of tuples, rather than direct tuples.","message":"The behavior of iterating over diff results changed in v0.11.0. Previously, `diff_tables` would yield individual tuples representing row changes. From v0.11.0 onwards, it now yields lists of tuples for individual changes (e.g., `(- , 4, 'David', 40)` is now `('-', 4, 'David', 40)`).","severity":"breaking","affected_versions":">=0.11.0"},{"fix":"Review the official documentation for current limitations and workarounds for SQL Server if it is a critical part of your data stack.","message":"SQL Server support is limited, and users might encounter issues or incomplete functionality compared to other supported databases.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always provide the `key_columns` argument (e.g., `key_columns='id'`) when initializing `TableSegment` for tables with primary or unique keys.","message":"For efficient diffing, it is crucial to specify `key_columns` when creating `TableSegment` objects. Without explicit key columns, `data-diff` may perform a full-table scan or fall back to less efficient diffing strategies.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Install the necessary database extras for your specific database(s) using `pip install \"collate-data-diff[<db_name>]\"`.","message":"Database-specific drivers are required via extra installs (e.g., `pip install \"collate-data-diff[postgresql]\"`). Failure to install the correct extras will lead to connection errors for that database type.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}