{"id":7607,"library":"pyspark-test","title":"PySpark DataFrame Testing Utility","description":"pyspark-test is a Python library designed to simplify unit testing for PySpark DataFrames. It provides a function, `assert_pyspark_df_equal`, inspired by the pandas testing module, which allows users to compare two Spark DataFrames and identify any differences. The library is currently at version 0.2.0 and has a stable, albeit infrequent, release cadence, focusing on its core DataFrame comparison functionality.","status":"active","version":"0.2.0","language":"en","source_language":"en","source_url":"https://github.com/debugger24/pyspark-test","tags":["pyspark","testing","dataframe","unit-test","spark"],"install":[{"cmd":"pip install pyspark-test","lang":"bash","label":"Install with pip"}],"dependencies":[{"reason":"Core dependency for creating and manipulating Spark DataFrames, which this library tests.","package":"pyspark"}],"imports":[{"symbol":"assert_pyspark_df_equal","correct":"from pyspark_test import assert_pyspark_df_equal"}],"quickstart":{"code":"import datetime\nfrom pyspark import SparkContext\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.types import StructType, StructField, DateType, StringType, DoubleType, LongType\nfrom pyspark_test import assert_pyspark_df_equal\n\n# Initialize SparkSession for testing\nsc = SparkContext.getOrCreate()\nspark = SparkSession(sc)\n\n# Create two identical DataFrames\ndf_1 = spark.createDataFrame(\n    data=[\n        [datetime.date(2020, 1, 1), 'apple', 1.123, 10],\n        [None, 'banana', 2.345, 20],\n    ],\n    schema=StructType([\n        StructField('col_a', DateType(), True),\n        StructField('col_b', StringType(), True),\n        StructField('col_c', DoubleType(), True),\n        StructField('col_d', LongType(), True),\n    ]),\n)\ndf_2 = spark.createDataFrame(\n    data=[\n        [datetime.date(2020, 1, 1), 'apple', 1.123, 10],\n        [None, 'banana', 2.345, 20],\n    ],\n    schema=StructType([\n        StructField('col_a', DateType(), True),\n        StructField('col_b', StringType(), True),\n        StructField('col_c', DoubleType(), True),\n        StructField('col_d', LongType(), True),\n    ]),\n)\n\n# Assert that the two DataFrames are equal\nprint(\"Asserting identical DataFrames...\")\nassert_pyspark_df_equal(df_1, df_2, check_dtype=True, check_column_names=True, check_columns_in_order=True, order_by=['col_a', 'col_b'])\nprint(\"Assertion successful: DataFrames are equal.\")\n\n# Example of intentionally different DataFrames to demonstrate failure\ndf_3 = spark.createDataFrame(\n    data=[\n        [datetime.date(2020, 1, 1), 'apple', 1.123, 10],\n        [None, 'orange', 99.999, 20], # Changed data\n    ],\n    schema=StructType([\n        StructField('col_a', DateType(), True),\n        StructField('col_b', StringType(), True),\n        StructField('col_c', DoubleType(), True),\n        StructField('col_d', LongType(), True),\n    ]),\n)\n\nprint(\"\\nAsserting different DataFrames (expected to fail)...\")\ntry:\n    assert_pyspark_df_equal(df_1, df_3, check_dtype=True, check_column_names=True, check_columns_in_order=True, order_by=['col_a', 'col_b'])\nexcept AssertionError as e:\n    print(f\"Caught expected error: {e}\")\n\n# Stop SparkSession\nspark.stop()\n","lang":"python","description":"This quickstart demonstrates how to use `assert_pyspark_df_equal` to compare two PySpark DataFrames. It includes the necessary setup for a local SparkSession and shows both successful and intentionally failing assertions to illustrate its usage and error reporting. The `check_dtype`, `check_column_names`, `check_columns_in_order`, and `order_by` parameters are used for a strict comparison."},"warnings":[{"fix":"Set `check_column_names=True`, `check_columns_in_order=True`, and `check_dtype=True` for strict comparisons. Use `order_by` if row order is not guaranteed but data content should be the same, allowing internal sorting before comparison.","message":"By default, `assert_pyspark_df_equal` does not check for column name equality (`check_column_names=False`) or column order (`check_columns_in_order=False`). This can lead to false positives if DataFrames have identical data but different column metadata or ordering. Always explicitly set comparison strictness.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For `pytest`, use `session`-scoped fixtures to create a single SparkSession for all tests. For `unittest`, use `setUpClass` and `tearDownClass`. Consider suppressing `py4j` logging to `WARN` or `ERROR` levels in your test configuration.","message":"Managing SparkSession setup and teardown in a test suite can be complex, leading to resource leaks or slow tests if not handled properly. Excessive logging from `py4j` (Spark's Java gateway) can also obscure relevant test output.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Examine the detailed error output provided by `assert_pyspark_df_equal`, which highlights differing rows and columns. Verify your transformation logic or expected input/output data. Ensure `order_by` is set if row order is non-deterministic.","cause":"The actual data within the DataFrames differs. This could be due to differences in individual cell values or the presence/absence of rows.","error":"AssertionError: DataFrames are not equal. ..."},{"fix":"Review the schema definition of both DataFrames. Ensure that all column names, their exact data types, and nullability properties are identical. Set `check_dtype=False` if you only care about data values and not type strictness.","cause":"The schemas (column names, data types, nullability) of the compared DataFrames do not match, and `check_dtype=True` was used.","error":"AssertionError: Schema are not equal. ..."},{"fix":"Ensure that both DataFrames have the exact same column names in the exact same order. If column order is not important, set `check_columns_in_order=False`.","cause":"When `check_column_names=True`, the DataFrames have different column names or the columns are in a different order, and `check_columns_in_order=True` was used.","error":"AssertionError: Column names are not equal. ..."}]}