Soda Core Spark Integration (Legacy)

3.5.6 · deprecated · verified Fri Apr 10

This entry describes `soda-core-spark`, an older Python library for data quality testing on Spark DataFrames. It was an extension of `Soda SQL` that allowed programmatic data quality checks. As of Soda v3, `soda-core-spark` and `soda-sql` have been deprecated. Spark DataFrame integration is now handled directly by the main `soda-core` library using its native Spark connection capabilities. The latest available version of this deprecated package is `3.5.6`.

Warnings

Install

Imports

Quickstart

This example demonstrates how to perform data quality checks on a Spark DataFrame using the deprecated `soda-core-spark` library (`sodaspark`). It initializes a Spark session, creates a sample DataFrame, defines data quality checks in a YAML string, and executes the scan programmatically. Please note that for modern usage, you should migrate to `soda-core`.

import os
from pyspark.sql import SparkSession
from sodaspark import scan

# Initialize Spark Session
spark_session = SparkSession.builder.appName("SodaSparkExample").getOrCreate()

# Create a sample DataFrame
df = spark_session.createDataFrame([
    {"id": "1", "name": "Alice", "age": 30},
    {"id": "2", "name": "Bob", "age": None},
    {"id": "3", "name": "Charlie", "age": 35},
    {"id": "4", "name": "David", "age": 22}
])

# Define data quality checks in YAML format
# For deprecated soda-spark, checks are passed as a string.
# For modern Soda Core, these would typically be in a separate .yml file.
scan_definition = """
table_name: my_dataframe
metrics:
  - row_count
  - missing_count(age)
  - avg(age)
checks:
  - row_count > 0
  - missing_count(age) < 1
  - avg(age) between 20 and 40
"""

# Execute the scan
# Note: data_source_name should be set if connecting to Soda Cloud,
# but for local programmatic scans, it's often 'spark_df' by default.
scan_results = scan.execute(
    data_frame=df, 
    scan_definition=scan_definition,
    data_source_name="spark_df" # Can be customized
)

print("Scan Results:")
print(scan_results.get_json_representation())

# Stop Spark Session
spark_session.stop()

# IMPORTANT: This quickstart uses the deprecated `sodaspark` library.
# For current Spark integration, please refer to Soda Core documentation and use
# `from soda.scan import Scan` and `scan.add_spark_session(...)`.

view raw JSON →