{"library":"soda-core-spark","title":"Soda Core Spark Integration (Legacy)","type":"library","description":"This entry describes `soda-core-spark`, an older Python library for data quality testing on Spark DataFrames. It was an extension of `Soda SQL` that allowed programmatic data quality checks. As of Soda v3, `soda-core-spark` and `soda-sql` have been deprecated. Spark DataFrame integration is now handled directly by the main `soda-core` library using its native Spark connection capabilities. The latest available version of this deprecated package is `3.5.6`.","language":"python","status":"deprecated","last_verified":"Sat May 16","install":{"commands":["pip install soda-core-spark"],"cli":{"name":"soda","version":"soda-core, version 3.5.6"}},"imports":["from sodaspark import scan"],"auth":{"required":false,"env_vars":[]},"links":{"homepage":"https://www.soda.io","github":null,"docs":null,"changelog":null,"pypi":"https://pypi.org/project/soda-core-spark/","npm":null,"openapi_spec":null,"status_page":null,"smithery":null},"quickstart":{"code":"import os\nfrom pyspark.sql import SparkSession\nfrom sodaspark import scan\n\n# Initialize Spark Session\nspark_session = SparkSession.builder.appName(\"SodaSparkExample\").getOrCreate()\n\n# Create a sample DataFrame\ndf = spark_session.createDataFrame([\n    {\"id\": \"1\", \"name\": \"Alice\", \"age\": 30},\n    {\"id\": \"2\", \"name\": \"Bob\", \"age\": None},\n    {\"id\": \"3\", \"name\": \"Charlie\", \"age\": 35},\n    {\"id\": \"4\", \"name\": \"David\", \"age\": 22}\n])\n\n# Define data quality checks in YAML format\n# For deprecated soda-spark, checks are passed as a string.\n# For modern Soda Core, these would typically be in a separate .yml file.\nscan_definition = \"\"\"\ntable_name: my_dataframe\nmetrics:\n  - row_count\n  - missing_count(age)\n  - avg(age)\nchecks:\n  - row_count > 0\n  - missing_count(age) < 1\n  - avg(age) between 20 and 40\n\"\"\"\n\n# Execute the scan\n# Note: data_source_name should be set if connecting to Soda Cloud,\n# but for local programmatic scans, it's often 'spark_df' by default.\nscan_results = scan.execute(\n    data_frame=df, \n    scan_definition=scan_definition,\n    data_source_name=\"spark_df\" # Can be customized\n)\n\nprint(\"Scan Results:\")\nprint(scan_results.get_json_representation())\n\n# Stop Spark Session\nspark_session.stop()\n\n# IMPORTANT: This quickstart uses the deprecated `sodaspark` library.\n# For current Spark integration, please refer to Soda Core documentation and use\n# `from soda.scan import Scan` and `scan.add_spark_session(...)`.\n","lang":"python","description":"This example demonstrates how to perform data quality checks on a Spark DataFrame using the deprecated `soda-core-spark` library (`sodaspark`). It initializes a Spark session, creates a sample DataFrame, defines data quality checks in a YAML string, and executes the scan programmatically. Please note that for modern usage, you should migrate to `soda-core`.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-16","installed_version":"3.5.6","pypi_latest":"3.5.6","is_stale":false,"summary":{"python_range":"3.10–3.9","success_rate":100,"avg_install_s":6.3,"avg_import_s":null,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"soda-core-spark","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"49.4M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"soda-core-spark","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":7,"import_time_s":null,"mem_mb":null,"disk_size":"50M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"soda-core-spark","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"54.9M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"soda-core-spark","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":6.2,"import_time_s":null,"mem_mb":null,"disk_size":"56M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"soda-core-spark","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"46.1M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"soda-core-spark","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":5.1,"import_time_s":null,"mem_mb":null,"disk_size":"47M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"soda-core-spark","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"44.1M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"soda-core-spark","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":5.1,"import_time_s":null,"mem_mb":null,"disk_size":"44M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"soda-core-spark","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"48.5M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"soda-core-spark","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":8.1,"import_time_s":null,"mem_mb":null,"disk_size":"49M"}]}}