Soda Core

4.3.0 · active · verified Thu Apr 09

Soda Core is an open-source command-line tool and Python library for data quality testing, monitoring, and scanning. It allows users to define data quality checks (e.g., freshness, uniqueness, validity) and execute them against various data sources to identify issues. Version 4.3.0 is current, with frequent updates and a strong focus on community contributions, requiring Python >=3.10.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to run a programmatic Soda Core scan against a local CSV file. It defines the data source and SodaCL checks as strings, executes the scan, and prints a summary of the results. If you don't have it, install `pandas` (`pip install pandas`) for the CSV creation portion of this example.

import os
import pandas as pd
from soda.scan import Scan

# Create a dummy CSV file for the example
csv_filename = "sample_data.csv"
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'value': ['A', 'B', 'C', 'D', 'E'],
    'status': ['active', 'inactive', 'active', 'active', None]
})
df.to_csv(csv_filename, index=False)

# Define the data source configuration as a string
data_source_config = f"""
  data_source my_csv_source:
    type: local_system
    file_system:
      type: local
    path: {os.getcwd()}
"""

# Define SodaCL checks as a string
sodacl_checks = f"""
  checks for {csv_filename}:
    - row_count > 0
    - missing_count(status) = 1
    - duplicate_count(id) = 0
"""

# Run the Soda Core scan programmatically
scan = Scan()
scan.set_verbose(True) # Optional: for more detailed output
scan.add_configuration_yaml_str(data_source_config)
scan.add_sodacl_yaml_str(sodacl_checks)
scan.set_data_source_name("my_csv_source") # Must match the name in data_source_config
scan.execute_scan()

print("\n--- Scan Results ---")
if scan.has_failures():
    print("Scan completed with failures.")
else:
    print("Scan completed successfully.")

# Clean up the dummy CSV file
os.remove(csv_filename)

view raw JSON →