Databricks Data Quality eXtended (DQX)
Data Quality eXtended (DQX) is a Python library for defining, executing, and monitoring data quality checks. It leverages Apache Spark and is designed to integrate seamlessly within the Databricks ecosystem, supporting features like Delta Lake, DLT, and Unity Catalog. The library is actively maintained with frequent releases, currently at version 0.13.0, introducing features like an enhanced data quality dashboard and AI-assisted rule generation.
Common errors
-
ModuleNotFoundError: No module named 'pyspark'
cause This error commonly occurs when installing or running databricks-labs-dqx via the Databricks Labs CLI in an environment where PySpark is not properly installed or the Python version is incompatible. DQX requires Python 3.10 or later and Databricks CLI v0.241 or later for installation using the CLI.fixEnsure your local Python environment has Python 3.10+ and Databricks CLI v0.241+ installed. When running DQX within a Databricks notebook or job, PySpark is typically available. If installing DQX as a library dependency for a local Spark application, ensure `pyspark` is explicitly installed in that environment via `pip install pyspark`. -
ImportError: cannot import name 'DQXInstaller'
cause This specific import error arises when attempting to import `DQXInstaller`, particularly when setting up DQX dashboards. This suggests that the `DQXInstaller` class or module path may have changed or is not intended for direct programmatic import for dashboard deployment in the installed version (e.g., v0.12.0 or later).fixFor setting up DQX dashboards, the recommended approach is to use the Databricks Labs CLI command: `databricks labs install dqx`. This command handles the deployment of all necessary DQX components, including the dashboard, within your Databricks workspace without requiring direct imports of installer classes. -
databricks.labs.dqx.commons.exceptions.DataQualityException: Data quality checks failed
cause This exception is raised by the DQX engine when one or more defined data quality rules fail, and their `criticality` level is set to 'error', indicating a severe data quality issue that halts further processing.fixTo resolve, examine the `_errors` and `_warnings` columns in the DataFrame returned by DQX to identify the specific checks that failed and the rows affected. You can then either cleanse the problematic data, refine your data quality rules, or adjust the `criticality` of specific checks to 'warn' if the observed issues are acceptable. Implement logic to quarantine or handle invalid data based on the DQX output. -
ModuleNotFoundError: No module named 'databricks.sdk'
cause The `databricks-labs-dqx` library often relies on the `databricks-sdk` for interacting with Databricks Workspace resources. This error indicates that the `databricks-sdk` package is missing from the Python environment where DQX is being used.fixInstall the `databricks-sdk` package in your Python environment. If you are in a Databricks notebook, use `%pip install databricks-sdk`. If running locally, use `pip install databricks-sdk`. Ensure the installed version of `databricks-sdk` is compatible with your `databricks-labs-dqx` version.
Warnings
- gotcha DQX requires an active SparkSession. When running outside a Databricks environment (e.g., locally), you must explicitly install `pyspark` and create a `SparkSession` instance before initializing `DQEngine`.
- breaking Starting from v0.7.1, the `apply_checks` method enforces strict type validation for rules. Rules must be passed as a list of `DQRule` objects. Passing dictionaries or other types directly will raise a `TypeError`.
- breaking The Data Quality Dashboard has been significantly enhanced and restructured in v0.13.0. Existing custom dashboard integrations or deployment scripts might require updates to align with the new three-tab structure and underlying APIs.
- gotcha The `DQGenerator` class for AI-assisted rule generation (v0.12.0) and ODCS Data Contract rule generation (v0.11.0) introduces new APIs. Users expecting rule generation might need to adopt these new classes and methods instead of manual rule creation.
Install
-
pip install databricks-labs-dqx pyspark
Imports
- DQEngine
from dqx.core.dq_engine import DQEngine
- DQRule
from dqx.core.rule import DQRule
- DQGenerator
from dqx.core.dq_generator import DQGenerator
Quickstart
from pyspark.sql import SparkSession
from dqx.core.dq_engine import DQEngine
from dqx.core.rule import DQRule
# Initialize SparkSession (for local execution)
spark = SparkSession.builder.appName("DQXQuickstart") \
.master("local[*]") \
.getOrCreate()
# Create a sample DataFrame
data = [("A", 1, "2023-01-01"), ("B", 2, "2023-01-02"), ("C", None, "2023-01-03"), ("D", 4, "2023-01-01")]
columns = ["id", "value", "event_date"]
df = spark.createDataFrame(data, columns)
# Define data quality rules
rules = [
DQRule("value_not_null", "value IS NOT NULL", "value column should not be null"),
DQRule("id_is_unique", "COUNT(DISTINCT id) = COUNT(id)", "id column should be unique", dq_check_type="Aggregated"),
DQRule("event_date_freshness", "event_date >= '2023-01-01'", "event_date should be recent")
]
# Initialize DQEngine
dq_engine = DQEngine(spark_session=spark)
# Apply checks
results = dq_engine.run_checks(df, checks=rules)
# Print results
print("Data Quality Check Results:")
results.display()
# Stop SparkSession
spark.stop()