Soda Core BigQuery
Soda Core BigQuery is an extension for Soda Core, an open-source data quality testing tool. It enables users to define, execute, and monitor data quality checks directly against data stored in Google BigQuery. This package provides the necessary connector and SQL dialect definitions to interact with BigQuery, allowing for comprehensive data quality assessments within a Python environment, typically managed through the Soda CLI and YAML configuration files. The current version is 3.5.6, and it follows the release cadence of the broader Soda Core project.
Common errors
-
ModuleNotFoundError: No module named 'soda.scan'
cause The base `soda-core` package, which provides the core `Scan` object and CLI, is either not installed or is an incompatible version.fixEnsure `soda-core` is installed in your environment. Running `pip install soda-core-bigquery` should install `soda-core` as a dependency, but if not, try `pip install soda-core` directly. -
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials.
cause Soda Core cannot find valid Google Cloud authentication credentials in the execution environment or specified configuration.fixSet the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your service account key JSON file, or authenticate your user by running `gcloud auth application-default login`. -
google.api_core.exceptions.NotFound: 404 Not found: Dataset <your-project>:<your-dataset>
cause The specified BigQuery project, dataset, or table in your `configuration.yml` or `checks.yml` does not exist, or the authenticated user/service account lacks permissions to access it.fixDouble-check the project ID, dataset ID, and table name for typos. Verify that the BigQuery credentials have `BigQuery Data Viewer` role for the data and `BigQuery Job User` for running queries on the target project/dataset. -
ERROR: Data source 'your_data_source_name' in configuration is not valid.
cause There is a syntax error, missing required field, or incorrect type in your `data_sources` configuration within `configuration.yml` or the programmatic configuration string.fixReview your `configuration.yml` (or `add_configuration_yaml_str` content) carefully. Ensure `type: bigquery` is correctly specified and that `project_id` (if required) is present and accurate, paying attention to YAML indentation.
Warnings
- breaking Soda Core 3.x introduced significant changes to the CLI commands (e.g., `soda scan` replaced `soda analyze`) and the structure of configuration YAML files compared to 2.x versions. Migrating from older versions requires updating commands and YAML definitions.
- gotcha Proper BigQuery authentication and IAM roles are critical. Common issues include incorrect service account keys, missing `GOOGLE_APPLICATION_CREDENTIALS` environment variable, or insufficient IAM permissions for the service account/user.
- gotcha `soda-core-bigquery` is an extension package. While `pip install soda-core-bigquery` typically pulls `soda-core` as a dependency, the core functionalities (CLI, `Scan` object) are provided by `soda-core`. Ensure `soda-core` is available and compatible.
- gotcha YAML configuration files (`configuration.yml`, `checks.yml`) are highly sensitive to indentation and syntax. Minor errors can lead to failures, unexecuted checks, or incorrect interpretations without clear error messages.
Install
-
pip install soda-core-bigquery
Imports
- Scan
from soda.scan import Scan
Quickstart
import os
from soda.scan import Scan
# Configure your GCP project ID. For local execution, ensure:
# 1. 'GOOGLE_APPLICATION_CREDENTIALS' env var points to a service account key JSON,
# OR 2. 'gcloud auth application-default login' has been run.
# os.environ['BIGQUERY_PROJECT_ID'] = 'your-gcp-project-id'
# Initialize a Soda Scan
scan = Scan()
# Add BigQuery data source configuration via a YAML string
# Replace 'my-dummy-project' with your actual GCP project ID
scan.add_configuration_yaml_str(f'''
data_sources:
bigquery_source:
type: bigquery
project_id: {os.environ.get('BIGQUERY_PROJECT_ID', 'my-dummy-project')}
''')
# Define data quality checks via a YAML string
# Replace 'my_dataset.my_table' with an actual BigQuery dataset.table for real checks
scan.add_checks_yaml_str('''
checks for my_dataset.my_table:
- row_count > 0: # Checks if the table has any rows
name: 'Table should not be empty'
- missing_count(id) = 0: # Checks for missing values in 'id' column
name: 'No missing IDs'
- duplicate_count(id) = 0: # Checks for duplicate values in 'id' column
name: 'No duplicate IDs'
''')
print("Running Soda Scan...")
scan.execute()
if scan.has_failures():
print("\nSoda Scan finished with failures.")
else:
print("\nSoda Scan finished with no failures.")
# You can inspect the scan results for detailed outcomes
# print(scan.get_scan_results())