pyddq
raw JSON → 5.0.0 verified Fri May 01 auth: no python deprecated
Python API for Drunken Data Quality (DDQ), a data quality validation library for Apache Spark DataFrames. Current version 5.0.0, supports Spark 2.2.1 and Python 3. Last release in 2017; project appears stable with no recent updates.
pip install pyddq Common errors
error ImportError: cannot import name 'Constraint' from 'pyddq' ↓
cause pyddq is not installed or installed incorrectly.
fix
Run 'pip install pyddq' and ensure it's in the correct Python environment.
error Py4JJavaError: An error occurred while calling o123.run. ↓
cause Spark session not initialized or incompatible Spark version.
fix
Create a SparkSession before running checks: spark = SparkSession.builder.appName('test').getOrCreate()
error AttributeError: module 'pyddq' has no attribute 'Runner' ↓
cause Wrong import path; Runner is in the runner submodule.
fix
Import as: from pyddq.runner import Runner
Warnings
gotcha pyddq requires a running SparkSession. Without initializing Spark, imports will fail or hang. ↓
fix Ensure SparkSession is created before using pyddq functions.
breaking The API uses Spark 2.0+ SparkSession; older SparkContext-based code will not work. ↓
fix Use SparkSession instead of SparkContext when creating DataFrame.
deprecated The library is no longer actively maintained. No updates since 2017; may not work with newer Spark versions. ↓
fix Consider alternatives like great_expectations or Deequ for modern Spark setups.
Imports
- Constraint wrong
import pyddq; pyddq.Constraintcorrectfrom pyddq import Constraint - Check wrong
from pyddq import Checkcorrectfrom pyddq.runner import Check - Runner wrong
import pyddq; pyddq.Runnercorrectfrom pyddq.runner import Runner
Quickstart
from pyddq import Constraint
from pyddq.runner import Runner, Check
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ddq_example').getOrCreate()
df = spark.createDataFrame([('Alice', 34), ('Bob', 45), ('Charlie', 28)], ['name', 'age'])
constraint = Constraint(name='age_not_null', condition="age IS NOT NULL")
check = Check(df, [constraint])
runner = Runner()
results = runner.run(check)
for r in results:
print(r.result, r.constraint_name)