Data Diff (collate-data-diff)
collate-data-diff (also known as data-diff) is a Python library and command-line tool designed to efficiently compare and find differences between rows across two databases or tables. It focuses on performance and scalability for large datasets, providing a fast and accurate way to detect data discrepancies. The current version is 0.11.10, and it maintains an active release cadence with frequent updates.
Warnings
- breaking The behavior of iterating over diff results changed in v0.11.0. Previously, `diff_tables` would yield individual tuples representing row changes. From v0.11.0 onwards, it now yields lists of tuples for individual changes (e.g., `(- , 4, 'David', 40)` is now `('-', 4, 'David', 40)`).
- gotcha SQL Server support is limited, and users might encounter issues or incomplete functionality compared to other supported databases.
- gotcha For efficient diffing, it is crucial to specify `key_columns` when creating `TableSegment` objects. Without explicit key columns, `data-diff` may perform a full-table scan or fall back to less efficient diffing strategies.
- gotcha Database-specific drivers are required via extra installs (e.g., `pip install "collate-data-diff[postgresql]"`). Failure to install the correct extras will lead to connection errors for that database type.
Install
-
pip install collate-data-diff -
pip install "collate-data-diff[sqlite]" "collate-data-diff[postgresql]" "collate-data-diff[mysql]" "collate-data-diff[snowflake]"
Imports
- diff_tables
from data_diff import diff_tables
- TableSegment
from data_diff import TableSegment
- connect_to_uri
from data_diff import connect_to_uri
Quickstart
import pandas as pd
import sqlalchemy as sa
from data_diff import diff_tables, TableSegment, connect_to_uri
# Create two in-memory SQLite databases for demonstration
uri_1 = "sqlite:///:memory:"
engine_1 = sa.create_engine(uri_1)
conn_1 = engine_1.connect()
uri_2 = "sqlite:///:memory:"
engine_2 = sa.create_engine(uri_2)
conn_2 = engine_2.connect()
# Create dummy data in pandas DataFrames
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'value': [10, 20, 30, 40]
})
df2 = pd.DataFrame({
'id': [1, 2, 3, 5], # id 4 removed, id 5 added
'name': ['Alice', 'Bob Changed', 'Charlie', 'Eve'], # Bob's name changed
'value': [10, 25, 30, 50] # Bob's value changed
})
# Populate tables in the in-memory databases
df1.to_sql('table_a', conn_1, index=False, if_exists='replace')
df2.to_sql('table_b', conn_2, index=False, if_exists='replace')
# Connect to the in-memory databases using data_diff's `connect_to_uri`
db1 = connect_to_uri(uri_1)
db2 = connect_to_uri(uri_2)
# Define TableSegment objects, specifying the key_columns for efficient diffing
table1 = TableSegment(db1, 'table_a', key_columns='id')
table2 = TableSegment(db2, 'table_b', key_columns='id')
# Perform the diff and print results
print("Differences found:")
diff_results = list(diff_tables(table1, table2))
for d in diff_results:
print(d)
# Expected Output:
# (- , 4, 'David', 40) # Row removed from table_a
# (+ , 5, 'Eve', 50) # Row added to table_b
# (- , 2, 'Bob', 20) # Original row for ID 2
# (+ , 2, 'Bob Changed', 25) # Modified row for ID 2