Chispa

0.12.0 · active · verified Fri Apr 10

Chispa is a PySpark test helper library that provides fast and descriptive methods for comparing Spark DataFrames. It's designed to make writing high-quality PySpark unit tests easier by offering clear error messages when assertions fail. The library is currently at version 0.12.0 and is actively maintained with regular releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `chispa.dataframe_comparer.assert_df_equality` to compare two PySpark DataFrames. It initializes a SparkSession, creates two DataFrames (one identical, one with a subtle difference), and then uses `assert_df_equality` to check their equivalence, showcasing both a passing and a failing scenario with Chispa's descriptive error messages.

import os
from pyspark.sql import SparkSession
from chispa.dataframe_comparer import assert_df_equality

# Initialize SparkSession (for local testing, typically done in a pytest fixture)
spark = SparkSession.builder.appName("ChispaQuickstart").getOrCreate()

# Create an 'actual' DataFrame
data_actual = [
    ("john", 1),
    ("jane", 2),
    ("doe", 3)
]
df_actual = spark.createDataFrame(data_actual, ["name", "id"])

# Create an 'expected' DataFrame (identical for a passing test)
data_expected = [
    ("john", 1),
    ("jane", 2),
    ("doe", 3)
]
df_expected = spark.createDataFrame(data_expected, ["name", "id"])

# Assert equality - this test should pass
try:
    assert_df_equality(df_actual, df_expected)
    print("Assertion Passed: df_actual and df_expected are equal.")
except Exception as e:
    print(f"Assertion Failed: {e}")

# Create a different 'expected' DataFrame to demonstrate a failing test
data_expected_fail = [
    ("john", 1),
    ("jane", 99), # Intentional difference
    ("doe", 3)
]
df_expected_fail = spark.createDataFrame(data_expected_fail, ["name", "id"])

# Assert equality - this test should fail, showing descriptive error
try:
    assert_df_equality(df_actual, df_expected_fail)
    print("Assertion Passed (unexpectedly): df_actual and df_expected_fail are equal.")
except Exception as e:
    print(f"Assertion Failed (as expected): {e}")

# Stop SparkSession
spark.stop()

view raw JSON →