PyDeequ

1.5.0 · active · verified Wed Apr 08

PyDeequ is a Python API for Deequ, a library built on top of Apache Spark for defining 'unit tests for data', which measure data quality in large datasets. The current version is 1.5.0, released on April 1, 2025. PyDeequ follows a regular release cadence, with updates approximately every few months.

Warnings

Install

Imports

Quickstart

This script sets up a Spark session, creates a sample DataFrame, performs a completeness analysis on column 'a', and displays the results.

import os
from pyspark.sql import SparkSession, Row
import pydeequ

# Set up Spark session
spark = (SparkSession
    .builder
    .config('spark.jars.packages', pydeequ.deequ_maven_coord)
    .config('spark.jars.excludes', pydeequ.f2j_maven_coord)
    .getOrCreate())

# Sample data
df = spark.sparkContext.parallelize([
            Row(a='foo', b=1, c=5),
            Row(a='bar', b=2, c=6),
            Row(a='baz', b=3, c=None)]).toDF()

# Perform analysis
from pydeequ.analyzers import AnalysisRunner
analysisResult = AnalysisRunner(spark) \
                    .onData(df) \
                    .addAnalyzer(pydeequ.analyzers.Completeness('a')) \
                    .run()

# Show results
analysisResult

view raw JSON →