{"id":5725,"library":"soda-core-spark","title":"Soda Core Spark Integration (Legacy)","description":"This entry describes `soda-core-spark`, an older Python library for data quality testing on Spark DataFrames. It was an extension of `Soda SQL` that allowed programmatic data quality checks. As of Soda v3, `soda-core-spark` and `soda-sql` have been deprecated. Spark DataFrame integration is now handled directly by the main `soda-core` library using its native Spark connection capabilities. The latest available version of this deprecated package is `3.5.6`.","status":"deprecated","version":"3.5.6","language":"en","source_language":"en","source_url":"https://github.com/sodadata/soda-spark","tags":["data quality","spark","deprecated","etl","data observability"],"install":[{"cmd":"pip install soda-core-spark","lang":"bash","label":"Install deprecated package"}],"dependencies":[{"reason":"Required for interacting with Apache Spark DataFrames.","package":"pyspark","optional":false}],"imports":[{"note":"The `sodaspark` module was part of the deprecated `soda-core-spark` for programmatic scans. For `soda-core` (the replacement), use `from soda.scan import Scan`.","wrong":"from soda.scan import Scan","symbol":"scan","correct":"from sodaspark import scan"}],"quickstart":{"code":"import os\nfrom pyspark.sql import SparkSession\nfrom sodaspark import scan\n\n# Initialize Spark Session\nspark_session = SparkSession.builder.appName(\"SodaSparkExample\").getOrCreate()\n\n# Create a sample DataFrame\ndf = spark_session.createDataFrame([\n    {\"id\": \"1\", \"name\": \"Alice\", \"age\": 30},\n    {\"id\": \"2\", \"name\": \"Bob\", \"age\": None},\n    {\"id\": \"3\", \"name\": \"Charlie\", \"age\": 35},\n    {\"id\": \"4\", \"name\": \"David\", \"age\": 22}\n])\n\n# Define data quality checks in YAML format\n# For deprecated soda-spark, checks are passed as a string.\n# For modern Soda Core, these would typically be in a separate .yml file.\nscan_definition = \"\"\"\ntable_name: my_dataframe\nmetrics:\n  - row_count\n  - missing_count(age)\n  - avg(age)\nchecks:\n  - row_count > 0\n  - missing_count(age) < 1\n  - avg(age) between 20 and 40\n\"\"\"\n\n# Execute the scan\n# Note: data_source_name should be set if connecting to Soda Cloud,\n# but for local programmatic scans, it's often 'spark_df' by default.\nscan_results = scan.execute(\n    data_frame=df, \n    scan_definition=scan_definition,\n    data_source_name=\"spark_df\" # Can be customized\n)\n\nprint(\"Scan Results:\")\nprint(scan_results.get_json_representation())\n\n# Stop Spark Session\nspark_session.stop()\n\n# IMPORTANT: This quickstart uses the deprecated `sodaspark` library.\n# For current Spark integration, please refer to Soda Core documentation and use\n# `from soda.scan import Scan` and `scan.add_spark_session(...)`.\n","lang":"python","description":"This example demonstrates how to perform data quality checks on a Spark DataFrame using the deprecated `soda-core-spark` library (`sodaspark`). It initializes a Spark session, creates a sample DataFrame, defines data quality checks in a YAML string, and executes the scan programmatically. Please note that for modern usage, you should migrate to `soda-core`."},"warnings":[{"fix":"Migrate to `soda-core` and use its native Apache Spark connection capabilities. This involves installing `soda-core` and configuring your `Scan` object with a Spark session.","message":"The `soda-core-spark` package has been officially deprecated. It, along with `Soda SQL`, has been replaced by `Soda Core` as the unified solution for data quality testing.","severity":"breaking","affected_versions":"<=3.5.6"},{"fix":"Review the Soda Core v4 documentation for migration guidance and examples of the new Data Contract format. Pin your `soda-core` dependency to a v3 version (e.g., `soda-core==3.5.6`) if you are not ready to upgrade to v4.","message":"Soda Core v4 (released January 28, 2026) introduces 'Data Contracts' as the default method for defining data quality rules, replacing the previous 'checks language' syntax. This is a significant breaking change for users migrating from older versions of Soda Core or `soda-core-spark`.","severity":"breaking","affected_versions":"4.x.x onwards (for `soda-core`)"},{"fix":"Ensure your environment uses compatible versions of Spark (e.g., Spark 3.x) and Python (e.g., Python 3.11 or lower) when working with `soda-core` v3. Check the official Soda Core documentation for the latest compatibility matrix.","message":"Soda Core v3 (which is the relevant version for migrating from `soda-core-spark`) has known compatibility limitations. Specifically, it does not support Apache Spark 4.0 or Python 3.12.","severity":"gotcha","affected_versions":"3.x.x (for `soda-core`)"},{"fix":"After creating or loading your Spark DataFrame, use `df.createOrReplaceTempView(\"your_temp_view_name\")` to make it accessible to Soda scans. Then, define your checks in YAML (or Data Contracts in v4) against this temporary view name.","message":"When using Soda Core with Spark DataFrames, you typically need to run Soda programmatically and register DataFrames as temporary views for checks to be executed.","severity":"gotcha","affected_versions":"All versions (for `soda-core` with Spark)"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}