{"id":1645,"library":"pydeequ","title":"PyDeequ","description":"PyDeequ is a Python API for Deequ, a library built on top of Apache Spark for defining 'unit tests for data', which measure data quality in large datasets. The current version is 1.5.0, released on April 1, 2025. PyDeequ follows a regular release cadence, with updates approximately every few months.","status":"active","version":"1.5.0","language":"en","source_language":"en","source_url":"https://github.com/awslabs/python-deequ","tags":["data quality","unit tests","Apache Spark","Python API"],"install":[{"cmd":"pip install pydeequ","lang":"bash","label":"Install PyDeequ"}],"dependencies":[{"reason":"PyDeequ relies on PySpark for distributed data processing capabilities.","package":"pyspark","optional":false}],"imports":[{"note":"Ensure to import AnalysisRunner from pydeequ.analyzers for data analysis tasks.","symbol":"AnalysisRunner","correct":"from pydeequ.analyzers import AnalysisRunner"}],"quickstart":{"code":"import os\nfrom pyspark.sql import SparkSession, Row\nimport pydeequ\n\n# Set up Spark session\nspark = (SparkSession\n    .builder\n    .config('spark.jars.packages', pydeequ.deequ_maven_coord)\n    .config('spark.jars.excludes', pydeequ.f2j_maven_coord)\n    .getOrCreate())\n\n# Sample data\ndf = spark.sparkContext.parallelize([\n            Row(a='foo', b=1, c=5),\n            Row(a='bar', b=2, c=6),\n            Row(a='baz', b=3, c=None)]).toDF()\n\n# Perform analysis\nfrom pydeequ.analyzers import AnalysisRunner\nanalysisResult = AnalysisRunner(spark) \\\n                    .onData(df) \\\n                    .addAnalyzer(pydeequ.analyzers.Completeness('a')) \\\n                    .run()\n\n# Show results\nanalysisResult","lang":"python","description":"This script sets up a Spark session, creates a sample DataFrame, performs a completeness analysis on column 'a', and displays the results."},"warnings":[{"fix":"Update your code to utilize Spark Connect for distributed data processing tasks.","message":"PyDeequ 2.0 introduces Spark Connect support, moving away from the Py4J-based JVM bridge. This change may require code modifications for compatibility.","severity":"breaking","affected_versions":">=2.0.0b1"},{"fix":"Use alternative methods for pattern matching in data validation tasks.","message":"The 'hasPattern' function is deprecated in PyDeequ 1.5.0 and will be removed in future releases.","severity":"deprecated","affected_versions":">=1.5.0"}],"env_vars":null,"last_verified":"2026-04-08T00:00:00.000Z","next_check":"2026-07-07T00:00:00.000Z"}