{"id":2918,"library":"datacompy","title":"DataComPy","description":"DataComPy is a powerful Python library designed to simplify the comparison of two DataFrames, supporting various backends including Pandas, Spark, Polars, and Snowflake. It generates detailed, human-readable reports highlighting discrepancies at both row and column levels, and allows for the specification of absolute or relative tolerance levels for numeric comparisons. The library is currently at version 0.19.5 and is actively progressing towards a v1 release, with new features targeting development branches while the 0.19.x branch is for maintenance and critical fixes.","status":"active","version":"0.19.5","language":"en","source_language":"en","source_url":"https://github.com/capitalone/datacompy.git","tags":["data comparison","dataframe","pandas","spark","polars","data quality","etl validation"],"install":[{"cmd":"pip install datacompy","lang":"bash","label":"Basic Installation"},{"cmd":"pip install datacompy[spark]","lang":"bash","label":"For Spark DataFrame comparison"},{"cmd":"pip install datacompy[polars]","lang":"bash","label":"For Polars DataFrame comparison"},{"cmd":"pip install datacompy[fugue]","lang":"bash","label":"For Fugue-supported backends (Dask, DuckDB, Ray, etc.)"},{"cmd":"pip install datacompy[snowflake]","lang":"bash","label":"For Snowflake (Snowpark) DataFrame comparison"}],"dependencies":[{"reason":"Core dependency for Pandas DataFrame comparisons.","package":"pandas","optional":false},{"reason":"Optional, required for Spark DataFrame comparisons.","package":"pyspark","optional":true},{"reason":"Optional, required for Polars DataFrame comparisons.","package":"polars","optional":true},{"reason":"Optional, enables comparison across various backends like Dask, DuckDB, Ray. Integrates with DataComPy.","package":"fugue","optional":true},{"reason":"Optional, required for Snowflake DataFrame comparisons.","package":"snowflake-snowpark-python","optional":true}],"imports":[{"symbol":"Compare","correct":"import datacompy\ncompare = datacompy.Compare(...)"},{"symbol":"Compare","correct":"from datacompy.core import Compare\ncompare = Compare(...)"},{"note":"LegacySparkCompare and SparkPandasCompare were removed in v0.17.0. The recommended approach for Spark is SparkSQLCompare, or using the generic `datacompy.is_match` with Fugue for various backends.","wrong":"from datacompy import SparkCompare","symbol":"SparkCompare","correct":"from datacompy.spark.sql import SparkSQLCompare"}],"quickstart":{"code":"import pandas as pd\nimport datacompy\nfrom io import StringIO\n\ndata1 = \"\"\"acct_id,dollar_amt,name\n1,123.45,Alice\n2,67.89,Bob\n3,99.99,Charlie\n4,10.00,David\n\"\"\"\ndf1 = pd.read_csv(StringIO(data1))\n\ndata2 = \"\"\"acct_id,dollar_amt,name\n1,123.45,Alice\n2,67.90,Bobbert\n3,99.99,Charlie\n5,11.00,Eve\n\"\"\"\ndf2 = pd.read_csv(StringIO(data2))\n\n# Perform comparison, joining on 'acct_id'\n# abs_tol and rel_tol are crucial for float comparisons\ncompare = datacompy.Compare(\n    df1, df2,\n    join_columns='acct_id',\n    abs_tol=0.001, # Absolute tolerance for numeric fields\n    rel_tol=0.0    # Relative tolerance for numeric fields\n)\n\n# Print a detailed report\nprint(compare.report())\n\n# Check if dataframes match entirely\nprint(f\"DataFrames match: {compare.matches()}\")","lang":"python","description":"This quickstart demonstrates how to compare two Pandas DataFrames using `datacompy.Compare`. It initializes two sample DataFrames, then creates a `Compare` object specifying the join column and tolerance levels for numeric comparisons. Finally, it prints a comprehensive report of differences and checks for an overall match."},"warnings":[{"fix":"Migrate to `datacompy.spark.sql.SparkSQLCompare` or `datacompy.is_match` with Fugue backend. Consult the documentation for specific backend usage.","message":"`LegacySparkCompare` and `SparkPandasCompare` classes were removed in version 0.17.0. For Spark DataFrame comparison, use `SparkSQLCompare` (from `datacompy.spark.sql`) or the general `datacompy.is_match` function with the appropriate backend setup (often leveraging Fugue).","severity":"breaking","affected_versions":">=0.17.0"},{"fix":"For large datasets, use `datacompy` with Spark, Polars, or Fugue backends after installing the relevant extras (e.g., `pip install datacompy[spark]`).","message":"The default `datacompy.Compare` (Pandas native) requires DataFrames to fit into memory. Comparing very large datasets may lead to out-of-memory errors. For larger-than-memory datasets, consider using the Spark SQL or Polars backend implementations, which are more performant for big data.","severity":"gotcha","affected_versions":"All"},{"fix":"Set appropriate `abs_tol` and `rel_tol` values when initializing the `Compare` object (e.g., `abs_tol=0.0001`, `rel_tol=0.01`).","message":"When comparing floating-point numbers, minor precision differences can cause mismatches even if values are conceptually the same. Always use `abs_tol` (absolute tolerance) and/or `rel_tol` (relative tolerance) parameters in `datacompy.Compare` to account for these expected deviations.","severity":"gotcha","affected_versions":"All"},{"fix":"Ensure `join_columns` provide sufficient granularity to uniquely identify rows, or pre-process duplicates if specific matching behavior is required.","message":"DataComPy's duplicate row matching logic can be 'naïve' if `join_columns` do not uniquely identify rows. If many duplicates exist, `datacompy` sorts by other fields to create a temporary ID, which might not align with desired matching.","severity":"gotcha","affected_versions":"All"},{"fix":"Either rename or replace 'DATACOMPY_NULL' values in your join columns before comparison, or fill nulls with a different sentinel value of your choice.","message":"If 'DATACOMPY_NULL' exists as a legitimate string value in your `join_columns`, it can conflict with how `datacompy` internally handles null values during duplicate matching, potentially causing merge failures.","severity":"gotcha","affected_versions":"All"},{"fix":"Monitor GitHub releases and documentation for `v1` migration guides when upgrading from `0.19.x`.","message":"DataComPy is actively moving towards a v1 release. The `0.19.x` branch will receive only dependency updates and critical bug fixes, with no new features. Future `v1` releases may introduce breaking changes as development targets `v1` branches (`develop` and `main`).","severity":"deprecated","affected_versions":"0.19.x"},{"fix":"If using Spark or Ray with Fugue, consider using Python versions <3.12 until full compatibility is announced. Otherwise, Pandas and Polars backends are supported.","message":"Python 3.12 and above currently have limited support with Spark and Ray within the Fugue backend. Pandas and Polars comparisons should work fine.","severity":"gotcha","affected_versions":">=3.12.0"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}