{"id":2770,"library":"soda-core-spark-df","title":"Soda Core Spark DataFrame Connector","description":"Soda Core Spark DF is a Soda Core package that enables Soda Core to connect to Spark DataFrames as a data source. It allows users to define and run data quality checks directly on Spark DataFrames, making it suitable for data pipelines that operate within a Spark environment. Current version is 3.5.6. Releases follow Soda Core's cadence, typically monthly or bi-monthly for minor versions.","status":"active","version":"3.5.6","language":"en","source_language":"en","source_url":"https://github.com/sodadata/soda-core-spark-df","tags":["data quality","spark","dataframe","etl","data testing","soda"],"install":[{"cmd":"pip install soda-core-spark-df pyspark","lang":"bash","label":"Install with PySpark"}],"dependencies":[{"reason":"This package extends Soda Core's functionality; Soda Core is the base library required for defining and executing scans.","package":"soda-core","optional":false},{"reason":"Required for creating and manipulating Spark DataFrames, which are the primary data source for this connector.","package":"pyspark","optional":false}],"imports":[{"symbol":"Scan","correct":"from soda.scan import Scan"},{"symbol":"SparkDfDataSource","correct":"from soda.spark_df_data_source import SparkDfDataSource"}],"quickstart":{"code":"from pyspark.sql import SparkSession\nfrom soda.scan import Scan\nfrom soda.spark_df_data_source import SparkDfDataSource\n\n# 1. Prepare your Spark DataFrame\nspark = SparkSession.builder.appName(\"SodaSparkTest\").getOrCreate()\ndata = [(\"Alice\", 1), (\"Bob\", 2), (\"Charlie\", None)]\ncolumns = [\"name\", \"id\"]\ndf = spark.createDataFrame(data, columns)\n\n# 2. Configure Soda Core and register the Spark DataFrame\nscan = Scan()\nscan.add_configuration_yaml_str(\n    f\"\"\"\ndata_source spark_df_source:\n  type: spark_df\n\"\"\"\n)\nscan.add_spark_session(spark) # Pass the active SparkSession\n\n# 3. Add the DataFrame to the scan as a data source\nspark_df_data_source = SparkDfDataSource(spark=spark, data_frame=df, data_source_name=\"spark_df_source\", table_name=\"my_spark_df_table\")\nscan.add_data_source(spark_df_data_source)\n\n# 4. Define your checks (e.g., in a checks.yaml or programmatically)\nscan.add_sodacl_yaml_str(\n    \"\"\"\nchecks_for my_spark_df_table:\n  - row_count > 0\n  - missing_count(id) = 1\n  - column_count = 2\n\"\"\"\n)\n\n# 5. Execute the scan\nscan.execute()\n\n# 6. Print scan results\nprint(scan.get_logs_text())\nif scan.has_failures():\n    print(\"Scan finished with failures.\")\n    exit(1)\nelif scan.has_warnings():\n    print(\"Scan finished with warnings.\")\n    exit(0)\nelse:\n    print(\"Scan finished successfully.\")\n","lang":"python","description":"This quickstart demonstrates how to set up a Spark Session, create a DataFrame, initialize a Soda `Scan` object, register the DataFrame as a data source using `SparkDfDataSource`, define basic data quality checks, and execute the scan to get results."},"warnings":[{"fix":"Ensure `pyspark` is installed and compatible with your Spark cluster/local setup (`pip install pyspark`). If running on a cluster, verify Spark drivers/executors have access to `soda-core-spark-df` and `soda-core`.","message":"PySpark Installation and Environment: Users often encounter issues due to `pyspark` not being installed or having version mismatches with their underlying Spark environment, leading to `ModuleNotFoundError` or runtime errors related to Spark context initialization.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Carefully verify that all naming (`data_source`, `data_source_name`, `table_name`, `checks_for`) is consistent and identical across your Soda Core configuration and SodaCL definitions.","message":"Configuration Mismatch: The `data_source` name in the `scan.add_configuration_yaml_str` must exactly match the `data_source_name` argument when initializing `SparkDfDataSource`. Similarly, the `table_name` argument in `SparkDfDataSource` must match `checks_for` in your SodaCL. Mismatches prevent checks from being executed.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always call `scan.add_spark_session(spark)` after creating your `Scan` object and before executing checks, ensuring the `Scan` has access to the Spark context.","message":"Forgetting to add Spark Session to Scan: Users might define `SparkDfDataSource` but forget to link the active `SparkSession` to the `Scan` object using `scan.add_spark_session(spark)`. This can lead to runtime errors or scans failing to execute correctly because the underlying Spark context is not available to the Soda Scan.","severity":"gotcha","affected_versions":"All `soda-core` 3.x versions"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}