Soda Core DuckDB Connector

3.5.6 · active · verified Mon Apr 13

soda-core-duckdb is a Python connector that enables Soda Core, an open-source data quality and data contract verification engine, to connect and run data quality checks against DuckDB databases. It facilitates defining data quality expectations in YAML (SodaCL) and executing scans programmatically or via CLI to validate data. The library is actively maintained as part of the broader Soda Core ecosystem, which sees frequent updates and new feature releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to set up an in-memory DuckDB database, define data quality checks using SodaCL in a `checks.yml` file, and then execute a programmatic scan using the `soda.scan.Scan` class to validate the data. It checks for a positive row count, a specific number of missing values in a column, and the total column count.

import os
import duckdb
from soda.scan import Scan

# 1. Create a dummy DuckDB database and a table
con = duckdb.connect(database=':memory:', read_only=False)
con.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR);")
con.execute("INSERT INTO my_table VALUES (1, 'Alice'), (2, 'Bob'), (3, NULL);")

# 2. Define a data source configuration (optional for in-memory, but good practice)
# This would typically be in a configuration.yml file
# ds_config_content = """
# data_source my_duckdb:
#   type: duckdb
#   connection:
#     database: ':memory:'
# """

# 3. Define SodaCL checks in a checks.yml file
checks_content = """
checks for my_table:
  - row_count > 0
  - missing_count(name) = 1
  - column_count = 2
"""

with open('checks.yml', 'w') as f:
    f.write(checks_content)

# 4. Programmatically run a Soda scan
scan = Scan()
scan.add_duckdb_connection(con)
scan.set_data_source_name('my_duckdb_source') # Logical name for the data source
scan.add_sodacl_yaml_files(file_paths=['checks.yml'])

print('Running Soda scan...')
scan.execute()

if scan.has_failures():
    print('Scan failed!')
    # Optionally, you can assert or raise an error
    # scan.assert_no_checks_fail()
else:
    print('Scan successful: all checks passed or warned.')

print(scan.get_logs_text())

# Clean up temporary files
os.remove('checks.yml')
con.close()

view raw JSON →