PyDeequ
PyDeequ is a Python API for Deequ, a library built on top of Apache Spark for defining 'unit tests for data', which measure data quality in large datasets. The current version is 1.5.0, released on April 1, 2025. PyDeequ follows a regular release cadence, with updates approximately every few months.
Warnings
- breaking PyDeequ 2.0 introduces Spark Connect support, moving away from the Py4J-based JVM bridge. This change may require code modifications for compatibility.
- deprecated The 'hasPattern' function is deprecated in PyDeequ 1.5.0 and will be removed in future releases.
Install
-
pip install pydeequ
Imports
- AnalysisRunner
from pydeequ.analyzers import AnalysisRunner
Quickstart
import os
from pyspark.sql import SparkSession, Row
import pydeequ
# Set up Spark session
spark = (SparkSession
.builder
.config('spark.jars.packages', pydeequ.deequ_maven_coord)
.config('spark.jars.excludes', pydeequ.f2j_maven_coord)
.getOrCreate())
# Sample data
df = spark.sparkContext.parallelize([
Row(a='foo', b=1, c=5),
Row(a='bar', b=2, c=6),
Row(a='baz', b=3, c=None)]).toDF()
# Perform analysis
from pydeequ.analyzers import AnalysisRunner
analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(pydeequ.analyzers.Completeness('a')) \
.run()
# Show results
analysisResult