pyspark-extension
raw JSON → 2.15.0.4.1 verified Fri May 01 auth: no python
A library providing useful extensions to Apache Spark, including DataFrame diff, column transformation utilities, Parquet metadata reading, Spark Connect support, and dependency installation helpers. Current version 2.15.0.4.1, supports Spark 3.2+, 4.0; release cadence is irregular with multiple releases per year.
pip install pyspark-extension Common errors
error ModuleNotFoundError: No module named 'pyspark_extension' ↓
cause Incorrect import path; the module is named 'spark_extension' not 'pyspark_extension'.
fix
Use 'import spark_extension' instead of 'import pyspark_extension'.
error Py4JJavaError: An error occurred while calling o123.diff. ↓
cause Missing Java/Scala jar required for diff operation on certain DataFrame types.
fix
Install with scala extra: pip install pyspark-extension[scala] or add the jar manually via spark.jars.
error AttributeError: module 'spark_extension' has no attribute 'comparators' ↓
cause Trying to import comparators as a top-level attribute.
fix
Use 'from spark_extension.comparators import ...' instead of 'spark_extension.comparators'.
Warnings
breaking Removed support for Spark 3.0 and 3.1 starting from version 2.15.0. Upgrade to Spark 3.2+ or pin pyspark-extension <2.15.0. ↓
fix Use Spark 3.2+ or install pyspark-extension==2.14.2
breaking All undocumented unintended public API parts were made private in version 2.15.0. Any use of internal symbols (e.g., _internal methods) will break. ↓
fix Only use documented public API; check release notes for symbols that became private.
gotcha The Java/Scala jar is required for some features (e.g., diff on DataFrames with complex types). Install with pip install pyspark-extension[scala] or manually add the jar to Spark session. ↓
fix Install with [scala] extra or configure spark.jars
deprecated Backticks handling: In version 2.14.0, columns with special characters are quoted with backticks; columns with only alphanumerics and underscores are no longer quoted. This may break comparisons relying on quoted column names in SQL expressions. ↓
fix Review column name quoting in any custom SQL using column names from this library.
Install
pip install pyspark-extension[scala] Imports
- diff wrong
from pyspark_extension import diffcorrectfrom spark_extension import diff - comparators wrong
from spark_extension import comparatorscorrectfrom spark_extension.comparators import default_comparator - encrypted parquet support wrong
from spark_extension import read_encrypted_parquetcorrectfrom spark_extension.parquet import read_encrypted_parquet
Quickstart
from pyspark.sql import SparkSession
from spark_extension import diff
spark = SparkSession.builder.appName("test").getOrCreate()
df1 = spark.createDataFrame([(1, "a"), (2, "b")], ["id", "val"])
df2 = spark.createDataFrame([(1, "a"), (3, "c")], ["id", "val"])
result = diff(df1, df2)
result.show()