DataComPy

0.19.5 · active · verified Sat Apr 11

DataComPy is a powerful Python library designed to simplify the comparison of two DataFrames, supporting various backends including Pandas, Spark, Polars, and Snowflake. It generates detailed, human-readable reports highlighting discrepancies at both row and column levels, and allows for the specification of absolute or relative tolerance levels for numeric comparisons. The library is currently at version 0.19.5 and is actively progressing towards a v1 release, with new features targeting development branches while the 0.19.x branch is for maintenance and critical fixes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to compare two Pandas DataFrames using `datacompy.Compare`. It initializes two sample DataFrames, then creates a `Compare` object specifying the join column and tolerance levels for numeric comparisons. Finally, it prints a comprehensive report of differences and checks for an overall match.

import pandas as pd
import datacompy
from io import StringIO

data1 = """acct_id,dollar_amt,name
1,123.45,Alice
2,67.89,Bob
3,99.99,Charlie
4,10.00,David
"""
df1 = pd.read_csv(StringIO(data1))

data2 = """acct_id,dollar_amt,name
1,123.45,Alice
2,67.90,Bobbert
3,99.99,Charlie
5,11.00,Eve
"""
df2 = pd.read_csv(StringIO(data2))

# Perform comparison, joining on 'acct_id'
# abs_tol and rel_tol are crucial for float comparisons
compare = datacompy.Compare(
    df1, df2,
    join_columns='acct_id',
    abs_tol=0.001, # Absolute tolerance for numeric fields
    rel_tol=0.0    # Relative tolerance for numeric fields
)

# Print a detailed report
print(compare.report())

# Check if dataframes match entirely
print(f"DataFrames match: {compare.matches()}")

view raw JSON →