Databricks Data Quality eXtended (DQX)

0.13.0 · active · verified Thu Apr 09

Data Quality eXtended (DQX) is a Python library for defining, executing, and monitoring data quality checks. It leverages Apache Spark and is designed to integrate seamlessly within the Databricks ecosystem, supporting features like Delta Lake, DLT, and Unity Catalog. The library is actively maintained with frequent releases, currently at version 0.13.0, introducing features like an enhanced data quality dashboard and AI-assisted rule generation.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to set up a local SparkSession, create a sample DataFrame, define data quality rules using `DQRule` objects, and execute them with `DQEngine.run_checks()`. The results are displayed and the SparkSession is stopped.

from pyspark.sql import SparkSession
from dqx.core.dq_engine import DQEngine
from dqx.core.rule import DQRule

# Initialize SparkSession (for local execution)
spark = SparkSession.builder.appName("DQXQuickstart") \
    .master("local[*]") \
    .getOrCreate()

# Create a sample DataFrame
data = [("A", 1, "2023-01-01"), ("B", 2, "2023-01-02"), ("C", None, "2023-01-03"), ("D", 4, "2023-01-01")]
columns = ["id", "value", "event_date"]
df = spark.createDataFrame(data, columns)

# Define data quality rules
rules = [
    DQRule("value_not_null", "value IS NOT NULL", "value column should not be null"),
    DQRule("id_is_unique", "COUNT(DISTINCT id) = COUNT(id)", "id column should be unique", dq_check_type="Aggregated"),
    DQRule("event_date_freshness", "event_date >= '2023-01-01'", "event_date should be recent")
]

# Initialize DQEngine
dq_engine = DQEngine(spark_session=spark)

# Apply checks
results = dq_engine.run_checks(df, checks=rules)

# Print results
print("Data Quality Check Results:")
results.display()

# Stop SparkSession
spark.stop()

view raw JSON →