Chispa
Chispa is a PySpark test helper library that provides fast and descriptive methods for comparing Spark DataFrames. It's designed to make writing high-quality PySpark unit tests easier by offering clear error messages when assertions fail. The library is currently at version 0.12.0 and is actively maintained with regular releases.
Warnings
- breaking Chispa versions prior to 0.12.0 may not be fully compatible with PySpark 4.x. Version 0.12.0 explicitly adds support for Spark 4.x, so users upgrading their PySpark environment should ensure they are on Chispa 0.12.0 or newer.
- deprecated The internal `bcolors` utility module was removed in `v0.11.0`. While primarily an internal refactor, direct imports of `chispa.bcolors` will now fail.
- gotcha By default, `assert_df_equality` performs a strict comparison, expecting identical schemas (including nullability and metadata), column order, and row order. Divergences in any of these aspects will cause an assertion failure unless corresponding `ignore_*` flags (e.g., `ignore_nullable`, `ignore_column_order`, `ignore_row_order`, `ignore_metadata`) are explicitly set to `True`.
- gotcha Prior to `v0.11.1`, a bug existed in `assert_df_equality` where using both `ignore_columns` and `ignore_row_order` simultaneously could lead to incorrect DataFrame comparisons due to faulty row ordering logic. This was resolved in `v0.11.1`.
Install
-
pip install chispa
Imports
- assert_df_equality
from chispa.dataframe_comparer import assert_df_equality
- assert_column_equality
from chispa.column_comparer import assert_column_equality
- assert_approx_df_equality
from chispa.dataframe_comparer import assert_approx_df_equality
Quickstart
import os
from pyspark.sql import SparkSession
from chispa.dataframe_comparer import assert_df_equality
# Initialize SparkSession (for local testing, typically done in a pytest fixture)
spark = SparkSession.builder.appName("ChispaQuickstart").getOrCreate()
# Create an 'actual' DataFrame
data_actual = [
("john", 1),
("jane", 2),
("doe", 3)
]
df_actual = spark.createDataFrame(data_actual, ["name", "id"])
# Create an 'expected' DataFrame (identical for a passing test)
data_expected = [
("john", 1),
("jane", 2),
("doe", 3)
]
df_expected = spark.createDataFrame(data_expected, ["name", "id"])
# Assert equality - this test should pass
try:
assert_df_equality(df_actual, df_expected)
print("Assertion Passed: df_actual and df_expected are equal.")
except Exception as e:
print(f"Assertion Failed: {e}")
# Create a different 'expected' DataFrame to demonstrate a failing test
data_expected_fail = [
("john", 1),
("jane", 99), # Intentional difference
("doe", 3)
]
df_expected_fail = spark.createDataFrame(data_expected_fail, ["name", "id"])
# Assert equality - this test should fail, showing descriptive error
try:
assert_df_equality(df_actual, df_expected_fail)
print("Assertion Passed (unexpectedly): df_actual and df_expected_fail are equal.")
except Exception as e:
print(f"Assertion Failed (as expected): {e}")
# Stop SparkSession
spark.stop()