{"id":6831,"library":"pytest-spark","title":"pytest-spark","description":"pytest-spark is a pytest plugin that simplifies testing PySpark applications by automatically providing session-scoped `spark_context` and `spark_session` fixtures. It enables users to configure the Spark environment, including setting SPARK_HOME and custom `spark_options`, directly within `pytest.ini`. The current version is 0.8.0, with an active development and release cycle.","status":"active","version":"0.8.0","language":"en","source_language":"en","source_url":"https://github.com/malexer/pytest-spark","tags":["pytest","spark","pyspark","testing","fixtures"],"install":[{"cmd":"pip install pytest-spark","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Testing framework plugin integrates with.","package":"pytest"},{"reason":"Core Apache Spark Python API it facilitates testing for.","package":"pyspark"},{"reason":"Required for Spark Connect functionality (Spark 3.4+), if used.","package":"pyspark[connect]","optional":true},{"reason":"Alternative package for Spark Connect (PySpark 4.x), if used.","package":"pyspark-connect","optional":true}],"imports":[{"note":"pytest-spark provides fixtures that pytest auto-discovers; no direct import from 'pytest_spark' itself is typically needed in test files.","symbol":"spark_session","correct":"def test_example(spark_session):"},{"note":"pytest-spark provides fixtures that pytest auto-discovers; no direct import from 'pytest_spark' itself is typically needed in test files.","symbol":"spark_context","correct":"def test_example(spark_context):"}],"quickstart":{"code":"import pytest\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.types import StructType, StructField, StringType, IntegerType\n\n# conftest.py (placed in your project's root or tests directory)\n@pytest.fixture(scope=\"session\")\ndef spark_session():\n    \"\"\"\n    Fixture for creating a SparkSession for testing.\n    This SparkSession is reused across all tests in the session.\n    \"\"\"\n    spark = SparkSession.builder \\\n        .master(\"local[*]\") \\\n        .appName(\"pytest-spark-session\") \\\n        .config(\"spark.driver.memory\", \"2g\") \\\n        .getOrCreate()\n    yield spark\n    spark.stop()\n\n# test_example.py (a sample test file)\ndef test_data_frame_creation(spark_session):\n    schema = StructType([\n        StructField(\"name\", StringType(), True),\n        StructField(\"age\", IntegerType(), True)\n    ])\n    data = [(\"Alice\", 1), (\"Bob\", 2)]\n    df = spark_session.createDataFrame(data, schema)\n    \n    assert df.count() == 2\n    assert df.columns == [\"name\", \"age\"]\n    assert df.collect()[0].name == \"Alice\"","lang":"python","description":"To get started, define a `spark_session` fixture in a `conftest.py` file. This fixture will automatically provide a SparkSession to your tests. Then, write tests that accept `spark_session` as an argument. You can run tests using `pytest` from your terminal."},"warnings":[{"fix":"Prefer installing `pyspark` via `pip` and omit `SPARK_HOME` if possible. If explicit `SPARK_HOME` is needed, use `pytest.ini` for project-level consistency or `--spark_home` for specific runs, understanding their precedence.","message":"Configuring `SPARK_HOME` has multiple methods (environment variable, `pytest.ini`, `--spark_home` CLI option) which are read in a specific order (CLI > `pytest.ini` > ENV). If `pyspark` is installed via pip, setting `SPARK_HOME` might not be necessary, leading to confusion or unexpected behavior if mismatched.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Use the `spark_session` fixture for tests involving Spark Connect.","message":"The `spark_context` fixture is not supported when using Spark Connect functionality. If you're working with Spark 3.4+ and Spark Connect, you must use the `spark_session` fixture instead.","severity":"breaking","affected_versions":"0.6.0 onwards (with Spark 3.4+)"},{"fix":"Add `spark_options = spark.sql.catalogImplementation: in-memory` under the `[pytest]` section in your `pytest.ini` to explicitly disable Hive support.","message":"By default, the `spark_session` fixture creates a SparkSession with Hive support enabled. If Hive jars are not desired or cause conflicts, you can explicitly disable Hive support by adding `spark_options = spark.sql.catalogImplementation: in-memory` to your `pytest.ini`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For robust DataFrame comparison, convert both expected and actual DataFrames to Pandas DataFrames, sort them by common keys (if order doesn't matter), and then use `pandas.testing.assert_frame_equal(df1.toPandas().sort_values(...), df2.toPandas().sort_values(...), check_like=True)` to ignore column order. Libraries like `chispa` also provide Spark DataFrame equality assertions.","message":"Comparing Spark DataFrames for equality can be challenging directly due to potential differences in row order, column order, or schema. Direct `==` comparison often fails even for logically identical DataFrames.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Utilize `pytest`'s `tmp_path_factory` fixture to generate a unique temporary directory for each `pytest-xdist` worker process, and configure Spark to use these isolated directories for its local storage (e.g., `spark.local.dir`).","message":"When running tests in parallel with `pytest-xdist`, session-scoped Spark fixtures (like `spark_session`) can interfere with each other if not properly isolated. Each parallel process might try to use the same temporary directories or resources, leading to data races or failures.","severity":"gotcha","affected_versions":"All versions when using `pytest-xdist`"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}