Soda Core Redshift
Soda Core Redshift is a plugin for Soda Core that enables data quality checks against Amazon Redshift data warehouses. It provides the necessary connector to allow Soda Core to interact with Redshift, fetch metadata, and execute SQL queries for data quality monitoring. The current version is 3.5.6, and it typically follows the release cycle of the main `soda-core` library, with frequent updates.
Common errors
-
ModuleNotFoundError: No module named 'psycopg2'
cause The `psycopg2-binary` package, which provides the PostgreSQL adapter needed for Redshift, is either not installed or its dependencies are missing.fixEnsure `psycopg2-binary` is installed: `pip install psycopg2-binary`. If using a specific `psycopg2` version, ensure it is built with necessary system libraries for your OS (e.g., `libpq-dev` on Debian/Ubuntu, `postgresql-devel` on CentOS/RHEL). -
SodaException: Data source 'redshift' not found in configuration
cause The `data_source` section in your `configuration.yml` (or configuration string) does not correctly define a data source named 'redshift', or the `type` is misspelled, or the `set_data_source_name()` in Python does not match.fixVerify that your `configuration.yml` contains a `data_source` entry with `type: redshift` and that the name matches the one used in `scan.set_data_source_name('redshift')`. -
FATAL: password authentication failed for user "your_username"
cause Incorrect username or password provided for the Redshift connection.fixDouble-check the `username` and `password` used in your configuration against your Redshift cluster's user credentials. Ensure there are no typos or leading/trailing spaces. Confirm the user exists and has permission to connect. -
Error: Could not connect to Redshift database: could not translate host name "your_redshift_host.com" to address: Name or service not known
cause The provided Redshift host name cannot be resolved by DNS, or there's a network issue preventing access.fixVerify the `host` value is correct and fully qualified. Check your network configuration and DNS settings. Ensure there are no firewalls or security groups blocking outbound connections to Redshift from where Soda Core is running.
Warnings
- gotcha Ensure `soda-core` and `soda-core-redshift` versions are compatible. While minor version mismatches often work, major version mismatches can lead to unexpected behavior or errors.
- gotcha Redshift connection details (host, port, database, username, password) are sensitive. Hardcoding them in configuration files is a security risk.
- gotcha Connectivity issues to Redshift are often due to network firewalls, security groups, or incorrect database credentials/permissions.
Install
-
pip install soda-core-redshift
Imports
- Scan
from soda.scan import Scan
Quickstart
import os
from soda.scan import Scan
# Configure Redshift connection details using environment variables for security
redshift_host = os.environ.get('REDSHIFT_HOST', 'your_redshift_host.com')
redshift_port = os.environ.get('REDSHIFT_PORT', '5439')
redshift_database = os.environ.get('REDSHIFT_DATABASE', 'your_db_name')
redshift_username = os.environ.get('REDSHIFT_USERNAME', 'your_username')
redshift_password = os.environ.get('REDSHIFT_PASSWORD', 'your_password')
# Define the Soda Core configuration as a string
configuration_yaml = f'''
data_source redshift:
type: redshift
host: {redshift_host}
port: {redshift_port}
database: {redshift_database}
username: {redshift_username}
password: {redshift_password}
'''
# Define data quality checks as a string
checks_yaml = '''
checks for dim_users:
- row_count > 0
- duplicate_count(user_id) = 0
- missing_count(email) = 0
'''
# Initialize and execute the Soda Scan
scan = Scan()
scan.set_data_source_name('redshift')
scan.add_configuration_yaml_str(configuration_yaml)
scan.add_checks_yaml_str(checks_yaml)
print("Running Soda Scan...")
scan.execute()
if scan.has_failures():
print("Scan finished with failures.")
exit(1)
elif scan.has_warnings():
print("Scan finished with warnings.")
elif scan.has_errors():
print("Scan finished with errors.")
exit(1)
else:
print("Scan finished successfully.")