Soda Core
Soda Core is an open-source command-line tool and Python library for data quality testing, monitoring, and scanning. It allows users to define data quality checks (e.g., freshness, uniqueness, validity) and execute them against various data sources to identify issues. Version 4.3.0 is current, with frequent updates and a strong focus on community contributions, requiring Python >=3.10.
Warnings
- breaking Major API changes occurred in Soda Core 4.0.0. The `Scan` object's methods, configuration file naming, and SodaCL syntax were revised. For example, `scan.set_scan_definition_name()` and `scan.add_sodacl_yaml_file()` from 3.x were replaced by methods like `scan.set_data_source_name()` and `scan.add_check_yaml_file()` or `add_sodacl_yaml_str()`.
- gotcha Soda Core itself does not include database drivers. You must install specific `soda-core-<data-source>` packages (e.g., `soda-core-postgres`, `soda-core-bigquery`, `soda-core-snowflake`) separately for the data sources you intend to scan. Failure to do so will result in connection errors.
- gotcha When running `soda scan` from the CLI without explicitly specifying configuration files (e.g., `-d data_source.yml -c checks.yml`), Soda Core automatically looks for `data_source.yml` and `checks.yml` (or `configuration.yml` in older versions) in the current working directory. This can lead to unexpected scans or configuration mismatches.
Install
-
pip install soda-core
Imports
- Scan
from soda.scan import Scan
Quickstart
import os
import pandas as pd
from soda.scan import Scan
# Create a dummy CSV file for the example
csv_filename = "sample_data.csv"
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'value': ['A', 'B', 'C', 'D', 'E'],
'status': ['active', 'inactive', 'active', 'active', None]
})
df.to_csv(csv_filename, index=False)
# Define the data source configuration as a string
data_source_config = f"""
data_source my_csv_source:
type: local_system
file_system:
type: local
path: {os.getcwd()}
"""
# Define SodaCL checks as a string
sodacl_checks = f"""
checks for {csv_filename}:
- row_count > 0
- missing_count(status) = 1
- duplicate_count(id) = 0
"""
# Run the Soda Core scan programmatically
scan = Scan()
scan.set_verbose(True) # Optional: for more detailed output
scan.add_configuration_yaml_str(data_source_config)
scan.add_sodacl_yaml_str(sodacl_checks)
scan.set_data_source_name("my_csv_source") # Must match the name in data_source_config
scan.execute_scan()
print("\n--- Scan Results ---")
if scan.has_failures():
print("Scan completed with failures.")
else:
print("Scan completed successfully.")
# Clean up the dummy CSV file
os.remove(csv_filename)