pyddq

5.0.0 verified Fri May 01 auth: no python deprecated

Python API for Drunken Data Quality (DDQ), a data quality validation library for Apache Spark DataFrames. Current version 5.0.0, supports Spark 2.2.1 and Python 3. Last release in 2017; project appears stable with no recent updates.

pip install pyddq

Common errors

error ImportError: cannot import name 'Constraint' from 'pyddq' ↓

cause pyddq is not installed or installed incorrectly.

fix

Run 'pip install pyddq' and ensure it's in the correct Python environment.

error Py4JJavaError: An error occurred while calling o123.run. ↓

cause Spark session not initialized or incompatible Spark version.

fix

Create a SparkSession before running checks: spark = SparkSession.builder.appName('test').getOrCreate()

error AttributeError: module 'pyddq' has no attribute 'Runner' ↓

cause Wrong import path; Runner is in the runner submodule.

fix

Import as: from pyddq.runner import Runner

Warnings

gotcha pyddq requires a running SparkSession. Without initializing Spark, imports will fail or hang. ↓

fix Ensure SparkSession is created before using pyddq functions.

breaking The API uses Spark 2.0+ SparkSession; older SparkContext-based code will not work. ↓

fix Use SparkSession instead of SparkContext when creating DataFrame.

deprecated The library is no longer actively maintained. No updates since 2017; may not work with newer Spark versions. ↓

fix Consider alternatives like great_expectations or Deequ for modern Spark setups.

Imports

Constraint
wrong
```
import pyddq; pyddq.Constraint
```
correct
```
from pyddq import Constraint
```
Direct import from pyddq is the standard method.
Check
wrong
```
from pyddq import Check
```
correct
```
from pyddq.runner import Check
```
Check class is inside the runner module, not top-level.
Runner
wrong
```
import pyddq; pyddq.Runner
```
correct
```
from pyddq.runner import Runner
```
Runner must be imported from the runner submodule.

Quickstart

Quickstart: create a Spark DataFrame, define a constraint, run a check with Runner.

from pyddq import Constraint
from pyddq.runner import Runner, Check
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ddq_example').getOrCreate()
df = spark.createDataFrame([('Alice', 34), ('Bob', 45), ('Charlie', 28)], ['name', 'age'])

constraint = Constraint(name='age_not_null', condition="age IS NOT NULL")
check = Check(df, [constraint])
runner = Runner()
results = runner.run(check)
for r in results:
    print(r.result, r.constraint_name)