Soda Core Spark DataFrame Connector

3.5.6 · active · verified Fri Apr 10

Soda Core Spark DF is a Soda Core package that enables Soda Core to connect to Spark DataFrames as a data source. It allows users to define and run data quality checks directly on Spark DataFrames, making it suitable for data pipelines that operate within a Spark environment. Current version is 3.5.6. Releases follow Soda Core's cadence, typically monthly or bi-monthly for minor versions.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to set up a Spark Session, create a DataFrame, initialize a Soda `Scan` object, register the DataFrame as a data source using `SparkDfDataSource`, define basic data quality checks, and execute the scan to get results.

from pyspark.sql import SparkSession
from soda.scan import Scan
from soda.spark_df_data_source import SparkDfDataSource

# 1. Prepare your Spark DataFrame
spark = SparkSession.builder.appName("SodaSparkTest").getOrCreate()
data = [("Alice", 1), ("Bob", 2), ("Charlie", None)]
columns = ["name", "id"]
df = spark.createDataFrame(data, columns)

# 2. Configure Soda Core and register the Spark DataFrame
scan = Scan()
scan.add_configuration_yaml_str(
    f"""
data_source spark_df_source:
  type: spark_df
"""
)
scan.add_spark_session(spark) # Pass the active SparkSession

# 3. Add the DataFrame to the scan as a data source
spark_df_data_source = SparkDfDataSource(spark=spark, data_frame=df, data_source_name="spark_df_source", table_name="my_spark_df_table")
scan.add_data_source(spark_df_data_source)

# 4. Define your checks (e.g., in a checks.yaml or programmatically)
scan.add_sodacl_yaml_str(
    """
checks_for my_spark_df_table:
  - row_count > 0
  - missing_count(id) = 1
  - column_count = 2
"""
)

# 5. Execute the scan
scan.execute()

# 6. Print scan results
print(scan.get_logs_text())
if scan.has_failures():
    print("Scan finished with failures.")
    exit(1)
elif scan.has_warnings():
    print("Scan finished with warnings.")
    exit(0)
else:
    print("Scan finished successfully.")

view raw JSON →