pyspark-extension

raw JSON →
2.15.0.4.1 verified Fri May 01 auth: no python

A library providing useful extensions to Apache Spark, including DataFrame diff, column transformation utilities, Parquet metadata reading, Spark Connect support, and dependency installation helpers. Current version 2.15.0.4.1, supports Spark 3.2+, 4.0; release cadence is irregular with multiple releases per year.

pip install pyspark-extension
error ModuleNotFoundError: No module named 'pyspark_extension'
cause Incorrect import path; the module is named 'spark_extension' not 'pyspark_extension'.
fix
Use 'import spark_extension' instead of 'import pyspark_extension'.
error Py4JJavaError: An error occurred while calling o123.diff.
cause Missing Java/Scala jar required for diff operation on certain DataFrame types.
fix
Install with scala extra: pip install pyspark-extension[scala] or add the jar manually via spark.jars.
error AttributeError: module 'spark_extension' has no attribute 'comparators'
cause Trying to import comparators as a top-level attribute.
fix
Use 'from spark_extension.comparators import ...' instead of 'spark_extension.comparators'.
breaking Removed support for Spark 3.0 and 3.1 starting from version 2.15.0. Upgrade to Spark 3.2+ or pin pyspark-extension <2.15.0.
fix Use Spark 3.2+ or install pyspark-extension==2.14.2
breaking All undocumented unintended public API parts were made private in version 2.15.0. Any use of internal symbols (e.g., _internal methods) will break.
fix Only use documented public API; check release notes for symbols that became private.
gotcha The Java/Scala jar is required for some features (e.g., diff on DataFrames with complex types). Install with pip install pyspark-extension[scala] or manually add the jar to Spark session.
fix Install with [scala] extra or configure spark.jars
deprecated Backticks handling: In version 2.14.0, columns with special characters are quoted with backticks; columns with only alphanumerics and underscores are no longer quoted. This may break comparisons relying on quoted column names in SQL expressions.
fix Review column name quoting in any custom SQL using column names from this library.
pip install pyspark-extension[scala]

Basic usage: diff two DataFrames

from pyspark.sql import SparkSession
from spark_extension import diff

spark = SparkSession.builder.appName("test").getOrCreate()
df1 = spark.createDataFrame([(1, "a"), (2, "b")], ["id", "val"])
df2 = spark.createDataFrame([(1, "a"), (3, "c")], ["id", "val"])
result = diff(df1, df2)
result.show()