{"id":2430,"library":"chispa","title":"Chispa","description":"Chispa is a PySpark test helper library that provides fast and descriptive methods for comparing Spark DataFrames. It's designed to make writing high-quality PySpark unit tests easier by offering clear error messages when assertions fail. The library is currently at version 0.12.0 and is actively maintained with regular releases.","status":"active","version":"0.12.0","language":"en","source_language":"en","source_url":"https://github.com/MrPowers/chispa","tags":["pyspark","testing","dataframe","assertion","unit-testing"],"install":[{"cmd":"pip install chispa","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Chispa is a test helper for PySpark DataFrames; PySpark is required for its functionality but not a direct install_requires dependency.","package":"pyspark","optional":true}],"imports":[{"symbol":"assert_df_equality","correct":"from chispa.dataframe_comparer import assert_df_equality"},{"symbol":"assert_column_equality","correct":"from chispa.column_comparer import assert_column_equality"},{"note":"Used for comparing DataFrames with floating point numbers, allowing for a specified precision.","symbol":"assert_approx_df_equality","correct":"from chispa.dataframe_comparer import assert_approx_df_equality"}],"quickstart":{"code":"import os\nfrom pyspark.sql import SparkSession\nfrom chispa.dataframe_comparer import assert_df_equality\n\n# Initialize SparkSession (for local testing, typically done in a pytest fixture)\nspark = SparkSession.builder.appName(\"ChispaQuickstart\").getOrCreate()\n\n# Create an 'actual' DataFrame\ndata_actual = [\n    (\"john\", 1),\n    (\"jane\", 2),\n    (\"doe\", 3)\n]\ndf_actual = spark.createDataFrame(data_actual, [\"name\", \"id\"])\n\n# Create an 'expected' DataFrame (identical for a passing test)\ndata_expected = [\n    (\"john\", 1),\n    (\"jane\", 2),\n    (\"doe\", 3)\n]\ndf_expected = spark.createDataFrame(data_expected, [\"name\", \"id\"])\n\n# Assert equality - this test should pass\ntry:\n    assert_df_equality(df_actual, df_expected)\n    print(\"Assertion Passed: df_actual and df_expected are equal.\")\nexcept Exception as e:\n    print(f\"Assertion Failed: {e}\")\n\n# Create a different 'expected' DataFrame to demonstrate a failing test\ndata_expected_fail = [\n    (\"john\", 1),\n    (\"jane\", 99), # Intentional difference\n    (\"doe\", 3)\n]\ndf_expected_fail = spark.createDataFrame(data_expected_fail, [\"name\", \"id\"])\n\n# Assert equality - this test should fail, showing descriptive error\ntry:\n    assert_df_equality(df_actual, df_expected_fail)\n    print(\"Assertion Passed (unexpectedly): df_actual and df_expected_fail are equal.\")\nexcept Exception as e:\n    print(f\"Assertion Failed (as expected): {e}\")\n\n# Stop SparkSession\nspark.stop()","lang":"python","description":"This quickstart demonstrates how to use `chispa.dataframe_comparer.assert_df_equality` to compare two PySpark DataFrames. It initializes a SparkSession, creates two DataFrames (one identical, one with a subtle difference), and then uses `assert_df_equality` to check their equivalence, showcasing both a passing and a failing scenario with Chispa's descriptive error messages."},"warnings":[{"fix":"Upgrade chispa to 0.12.0 or a later version: `pip install --upgrade chispa`","message":"Chispa versions prior to 0.12.0 may not be fully compatible with PySpark 4.x. Version 0.12.0 explicitly adds support for Spark 4.x, so users upgrading their PySpark environment should ensure they are on Chispa 0.12.0 or newer.","severity":"breaking","affected_versions":"<0.12.0"},{"fix":"Remove any direct imports of `chispa.bcolors`. If custom console coloring is needed, use a dedicated library or Python's `colorama`.","message":"The internal `bcolors` utility module was removed in `v0.11.0`. While primarily an internal refactor, direct imports of `chispa.bcolors` will now fail.","severity":"deprecated","affected_versions":">=0.11.0"},{"fix":"Review the specific comparison requirements. Use `ignore_nullable=True`, `ignore_metadata=True`, `ignore_column_order=True`, or `ignore_row_order=True` as appropriate for your test case. For floating-point comparisons, consider `assert_approx_df_equality` or the `precision` argument.","message":"By default, `assert_df_equality` performs a strict comparison, expecting identical schemas (including nullability and metadata), column order, and row order. Divergences in any of these aspects will cause an assertion failure unless corresponding `ignore_*` flags (e.g., `ignore_nullable`, `ignore_column_order`, `ignore_row_order`, `ignore_metadata`) are explicitly set to `True`.","severity":"gotcha","affected_versions":"All"},{"fix":"Upgrade chispa to version 0.11.1 or newer to ensure correct behavior when combining `ignore_columns` and `ignore_row_order`.","message":"Prior to `v0.11.1`, a bug existed in `assert_df_equality` where using both `ignore_columns` and `ignore_row_order` simultaneously could lead to incorrect DataFrame comparisons due to faulty row ordering logic. This was resolved in `v0.11.1`.","severity":"gotcha","affected_versions":"<0.11.1"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}