Soda Core BigQuery

3.5.6 · active · verified Fri Apr 17

Soda Core BigQuery is an extension for Soda Core, an open-source data quality testing tool. It enables users to define, execute, and monitor data quality checks directly against data stored in Google BigQuery. This package provides the necessary connector and SQL dialect definitions to interact with BigQuery, allowing for comprehensive data quality assessments within a Python environment, typically managed through the Soda CLI and YAML configuration files. The current version is 3.5.6, and it follows the release cadence of the broader Soda Core project.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to run a Soda Scan programmatically against a BigQuery data source. It configures the BigQuery connection and defines simple data quality checks. For this to run successfully against your data, ensure BigQuery authentication is set up (e.g., `GOOGLE_APPLICATION_CREDENTIALS` environment variable or `gcloud auth application-default login`) and replace placeholder values for `project_id`, `dataset`, and `table` with your actual BigQuery resources.

import os
from soda.scan import Scan

# Configure your GCP project ID. For local execution, ensure:
# 1. 'GOOGLE_APPLICATION_CREDENTIALS' env var points to a service account key JSON,
#    OR 2. 'gcloud auth application-default login' has been run.
# os.environ['BIGQUERY_PROJECT_ID'] = 'your-gcp-project-id'

# Initialize a Soda Scan
scan = Scan()

# Add BigQuery data source configuration via a YAML string
# Replace 'my-dummy-project' with your actual GCP project ID
scan.add_configuration_yaml_str(f'''
data_sources:
  bigquery_source:
    type: bigquery
    project_id: {os.environ.get('BIGQUERY_PROJECT_ID', 'my-dummy-project')} 
''')

# Define data quality checks via a YAML string
# Replace 'my_dataset.my_table' with an actual BigQuery dataset.table for real checks
scan.add_checks_yaml_str('''
checks for my_dataset.my_table:
  - row_count > 0: # Checks if the table has any rows
      name: 'Table should not be empty'
  - missing_count(id) = 0: # Checks for missing values in 'id' column
      name: 'No missing IDs'
  - duplicate_count(id) = 0: # Checks for duplicate values in 'id' column
      name: 'No duplicate IDs'
''')

print("Running Soda Scan...")
scan.execute()

if scan.has_failures():
    print("\nSoda Scan finished with failures.")
else:
    print("\nSoda Scan finished with no failures.")

# You can inspect the scan results for detailed outcomes
# print(scan.get_scan_results())

view raw JSON →