pyddq

raw JSON →
5.0.0 verified Fri May 01 auth: no python deprecated

Python API for Drunken Data Quality (DDQ), a data quality validation library for Apache Spark DataFrames. Current version 5.0.0, supports Spark 2.2.1 and Python 3. Last release in 2017; project appears stable with no recent updates.

pip install pyddq
error ImportError: cannot import name 'Constraint' from 'pyddq'
cause pyddq is not installed or installed incorrectly.
fix
Run 'pip install pyddq' and ensure it's in the correct Python environment.
error Py4JJavaError: An error occurred while calling o123.run.
cause Spark session not initialized or incompatible Spark version.
fix
Create a SparkSession before running checks: spark = SparkSession.builder.appName('test').getOrCreate()
error AttributeError: module 'pyddq' has no attribute 'Runner'
cause Wrong import path; Runner is in the runner submodule.
fix
Import as: from pyddq.runner import Runner
gotcha pyddq requires a running SparkSession. Without initializing Spark, imports will fail or hang.
fix Ensure SparkSession is created before using pyddq functions.
breaking The API uses Spark 2.0+ SparkSession; older SparkContext-based code will not work.
fix Use SparkSession instead of SparkContext when creating DataFrame.
deprecated The library is no longer actively maintained. No updates since 2017; may not work with newer Spark versions.
fix Consider alternatives like great_expectations or Deequ for modern Spark setups.

Quickstart: create a Spark DataFrame, define a constraint, run a check with Runner.

from pyddq import Constraint
from pyddq.runner import Runner, Check
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ddq_example').getOrCreate()
df = spark.createDataFrame([('Alice', 34), ('Bob', 45), ('Charlie', 28)], ['name', 'age'])

constraint = Constraint(name='age_not_null', condition="age IS NOT NULL")
check = Check(df, [constraint])
runner = Runner()
results = runner.run(check)
for r in results:
    print(r.result, r.constraint_name)