Soda Core DuckDB Connector
soda-core-duckdb is a Python connector that enables Soda Core, an open-source data quality and data contract verification engine, to connect and run data quality checks against DuckDB databases. It facilitates defining data quality expectations in YAML (SodaCL) and executing scans programmatically or via CLI to validate data. The library is actively maintained as part of the broader Soda Core ecosystem, which sees frequent updates and new feature releases.
Warnings
- breaking Soda Core v4 introduced a breaking change, moving from a 'checks language' (used in `checks.yml`) to a 'Data Contracts-based syntax' (`contract.yml`). Users upgrading from Soda Core v3.x to v4.x will need to migrate their data quality definitions.
- gotcha When defining DuckDB connections in `configuration.yml` for Soda Core v3, some users have reported issues when using the `database` key to specify the DuckDB file path. Using `path` instead often resolves the problem.
- gotcha Specific versions of `soda-core` (and by extension `soda-core-duckdb`) might have strict `duckdb` version requirements. For example, `soda-core` v3.5.0 relaxed its `duckdb` dependency to `<1.1.0`.
- gotcha An `ImportError: dlopen(.../site-packages/google/protobuf/pyext/_message.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace` can occur due to a transitive dependency from `opentelemetry` that gathers OSS usage statistics in Soda Core 3.x.
Install
-
pip install soda-core-duckdb
Imports
- Scan
from soda.scan import Scan
Quickstart
import os
import duckdb
from soda.scan import Scan
# 1. Create a dummy DuckDB database and a table
con = duckdb.connect(database=':memory:', read_only=False)
con.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR);")
con.execute("INSERT INTO my_table VALUES (1, 'Alice'), (2, 'Bob'), (3, NULL);")
# 2. Define a data source configuration (optional for in-memory, but good practice)
# This would typically be in a configuration.yml file
# ds_config_content = """
# data_source my_duckdb:
# type: duckdb
# connection:
# database: ':memory:'
# """
# 3. Define SodaCL checks in a checks.yml file
checks_content = """
checks for my_table:
- row_count > 0
- missing_count(name) = 1
- column_count = 2
"""
with open('checks.yml', 'w') as f:
f.write(checks_content)
# 4. Programmatically run a Soda scan
scan = Scan()
scan.add_duckdb_connection(con)
scan.set_data_source_name('my_duckdb_source') # Logical name for the data source
scan.add_sodacl_yaml_files(file_paths=['checks.yml'])
print('Running Soda scan...')
scan.execute()
if scan.has_failures():
print('Scan failed!')
# Optionally, you can assert or raise an error
# scan.assert_no_checks_fail()
else:
print('Scan successful: all checks passed or warned.')
print(scan.get_logs_text())
# Clean up temporary files
os.remove('checks.yml')
con.close()