DataComPy
DataComPy is a powerful Python library designed to simplify the comparison of two DataFrames, supporting various backends including Pandas, Spark, Polars, and Snowflake. It generates detailed, human-readable reports highlighting discrepancies at both row and column levels, and allows for the specification of absolute or relative tolerance levels for numeric comparisons. The library is currently at version 0.19.5 and is actively progressing towards a v1 release, with new features targeting development branches while the 0.19.x branch is for maintenance and critical fixes.
Warnings
- breaking `LegacySparkCompare` and `SparkPandasCompare` classes were removed in version 0.17.0. For Spark DataFrame comparison, use `SparkSQLCompare` (from `datacompy.spark.sql`) or the general `datacompy.is_match` function with the appropriate backend setup (often leveraging Fugue).
- gotcha The default `datacompy.Compare` (Pandas native) requires DataFrames to fit into memory. Comparing very large datasets may lead to out-of-memory errors. For larger-than-memory datasets, consider using the Spark SQL or Polars backend implementations, which are more performant for big data.
- gotcha When comparing floating-point numbers, minor precision differences can cause mismatches even if values are conceptually the same. Always use `abs_tol` (absolute tolerance) and/or `rel_tol` (relative tolerance) parameters in `datacompy.Compare` to account for these expected deviations.
- gotcha DataComPy's duplicate row matching logic can be 'naïve' if `join_columns` do not uniquely identify rows. If many duplicates exist, `datacompy` sorts by other fields to create a temporary ID, which might not align with desired matching.
- gotcha If 'DATACOMPY_NULL' exists as a legitimate string value in your `join_columns`, it can conflict with how `datacompy` internally handles null values during duplicate matching, potentially causing merge failures.
- deprecated DataComPy is actively moving towards a v1 release. The `0.19.x` branch will receive only dependency updates and critical bug fixes, with no new features. Future `v1` releases may introduce breaking changes as development targets `v1` branches (`develop` and `main`).
- gotcha Python 3.12 and above currently have limited support with Spark and Ray within the Fugue backend. Pandas and Polars comparisons should work fine.
Install
-
pip install datacompy -
pip install datacompy[spark] -
pip install datacompy[polars] -
pip install datacompy[fugue] -
pip install datacompy[snowflake]
Imports
- Compare
import datacompy compare = datacompy.Compare(...)
- Compare
from datacompy.core import Compare compare = Compare(...)
- SparkCompare
from datacompy.spark.sql import SparkSQLCompare
Quickstart
import pandas as pd
import datacompy
from io import StringIO
data1 = """acct_id,dollar_amt,name
1,123.45,Alice
2,67.89,Bob
3,99.99,Charlie
4,10.00,David
"""
df1 = pd.read_csv(StringIO(data1))
data2 = """acct_id,dollar_amt,name
1,123.45,Alice
2,67.90,Bobbert
3,99.99,Charlie
5,11.00,Eve
"""
df2 = pd.read_csv(StringIO(data2))
# Perform comparison, joining on 'acct_id'
# abs_tol and rel_tol are crucial for float comparisons
compare = datacompy.Compare(
df1, df2,
join_columns='acct_id',
abs_tol=0.001, # Absolute tolerance for numeric fields
rel_tol=0.0 # Relative tolerance for numeric fields
)
# Print a detailed report
print(compare.report())
# Check if dataframes match entirely
print(f"DataFrames match: {compare.matches()}")