Soda Core Spark Integration (Legacy)
This entry describes `soda-core-spark`, an older Python library for data quality testing on Spark DataFrames. It was an extension of `Soda SQL` that allowed programmatic data quality checks. As of Soda v3, `soda-core-spark` and `soda-sql` have been deprecated. Spark DataFrame integration is now handled directly by the main `soda-core` library using its native Spark connection capabilities. The latest available version of this deprecated package is `3.5.6`.
Warnings
- breaking The `soda-core-spark` package has been officially deprecated. It, along with `Soda SQL`, has been replaced by `Soda Core` as the unified solution for data quality testing.
- breaking Soda Core v4 (released January 28, 2026) introduces 'Data Contracts' as the default method for defining data quality rules, replacing the previous 'checks language' syntax. This is a significant breaking change for users migrating from older versions of Soda Core or `soda-core-spark`.
- gotcha Soda Core v3 (which is the relevant version for migrating from `soda-core-spark`) has known compatibility limitations. Specifically, it does not support Apache Spark 4.0 or Python 3.12.
- gotcha When using Soda Core with Spark DataFrames, you typically need to run Soda programmatically and register DataFrames as temporary views for checks to be executed.
Install
-
pip install soda-core-spark
Imports
- scan
from sodaspark import scan
Quickstart
import os
from pyspark.sql import SparkSession
from sodaspark import scan
# Initialize Spark Session
spark_session = SparkSession.builder.appName("SodaSparkExample").getOrCreate()
# Create a sample DataFrame
df = spark_session.createDataFrame([
{"id": "1", "name": "Alice", "age": 30},
{"id": "2", "name": "Bob", "age": None},
{"id": "3", "name": "Charlie", "age": 35},
{"id": "4", "name": "David", "age": 22}
])
# Define data quality checks in YAML format
# For deprecated soda-spark, checks are passed as a string.
# For modern Soda Core, these would typically be in a separate .yml file.
scan_definition = """
table_name: my_dataframe
metrics:
- row_count
- missing_count(age)
- avg(age)
checks:
- row_count > 0
- missing_count(age) < 1
- avg(age) between 20 and 40
"""
# Execute the scan
# Note: data_source_name should be set if connecting to Soda Cloud,
# but for local programmatic scans, it's often 'spark_df' by default.
scan_results = scan.execute(
data_frame=df,
scan_definition=scan_definition,
data_source_name="spark_df" # Can be customized
)
print("Scan Results:")
print(scan_results.get_json_representation())
# Stop Spark Session
spark_session.stop()
# IMPORTANT: This quickstart uses the deprecated `sodaspark` library.
# For current Spark integration, please refer to Soda Core documentation and use
# `from soda.scan import Scan` and `scan.add_spark_session(...)`.