Data Diff (collate-data-diff)

0.11.10 · active · verified Sun Apr 12

collate-data-diff (also known as data-diff) is a Python library and command-line tool designed to efficiently compare and find differences between rows across two databases or tables. It focuses on performance and scalability for large datasets, providing a fast and accurate way to detect data discrepancies. The current version is 0.11.10, and it maintains an active release cadence with frequent updates.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `data-diff` programmatically to compare two tables in in-memory SQLite databases. It requires `collate-data-diff`, `pandas` (for creating dummy dataframes), and `sqlalchemy` (for database connectivity). Install with `pip install collate-data-diff[sqlite] pandas sqlalchemy`.

import pandas as pd
import sqlalchemy as sa
from data_diff import diff_tables, TableSegment, connect_to_uri

# Create two in-memory SQLite databases for demonstration
uri_1 = "sqlite:///:memory:"
engine_1 = sa.create_engine(uri_1)
conn_1 = engine_1.connect()

uri_2 = "sqlite:///:memory:"
engine_2 = sa.create_engine(uri_2)
conn_2 = engine_2.connect()

# Create dummy data in pandas DataFrames
df1 = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'value': [10, 20, 30, 40]
})
df2 = pd.DataFrame({
    'id': [1, 2, 3, 5], # id 4 removed, id 5 added
    'name': ['Alice', 'Bob Changed', 'Charlie', 'Eve'], # Bob's name changed
    'value': [10, 25, 30, 50] # Bob's value changed
})

# Populate tables in the in-memory databases
df1.to_sql('table_a', conn_1, index=False, if_exists='replace')
df2.to_sql('table_b', conn_2, index=False, if_exists='replace')

# Connect to the in-memory databases using data_diff's `connect_to_uri`
db1 = connect_to_uri(uri_1)
db2 = connect_to_uri(uri_2)

# Define TableSegment objects, specifying the key_columns for efficient diffing
table1 = TableSegment(db1, 'table_a', key_columns='id')
table2 = TableSegment(db2, 'table_b', key_columns='id')

# Perform the diff and print results
print("Differences found:")
diff_results = list(diff_tables(table1, table2))
for d in diff_results:
    print(d)

# Expected Output:
# (- , 4, 'David', 40)  # Row removed from table_a
# (+ , 5, 'Eve', 50)    # Row added to table_b
# (- , 2, 'Bob', 20)    # Original row for ID 2
# (+ , 2, 'Bob Changed', 25) # Modified row for ID 2

view raw JSON →