PySpark Regression

raw JSON →
4.2.4 verified Sat May 09 auth: no python

PySpark Regression is a Python library for regression testing Spark DataFrames, enabling comparison of outputs between different runs or environments. Version 4.2.4 is current; release cadence is irregular but active.

pip install pyspark-regression
error ModuleNotFoundError: No module named 'pyspark_regression'
cause The package is installed as pyspark-regression but import uses underscore.
fix
Install with pip install pyspark-regression and import using from pyspark_regression import RegressionTest.
error AttributeError: module 'pyspark_regression' has no attribute 'RegressionTest'
cause Using wrong import path; RegressionTest is not in a submodule.
fix
Use from pyspark_regression import RegressionTest directly.
error pyspark.errors.exceptions.capture.SparkException: Found duplicate keys in DataFrame
cause Key column contains non-unique values.
fix
Drop duplicates or use a different key/combination of columns that yields unique rows.
breaking In version 2.x, the API used `compare_dataframes()`; this was removed in 3.0+. Use `RegressionTest.compare()` instead.
fix Replace `compare_dataframes(df1, df2, key)` with `RegressionTest().compare(df1, df2, key=key)`.
gotcha The key column must have unique values in each DataFrame; non-unique keys will raise an error.
fix Ensure key column contains unique values, or use additional columns to form a composite key.
gotcha Both DataFrames must have identical column order; differences in column order cause match failures.
fix Use `df.select(*ordered_columns)` before comparing to enforce column order.
deprecated The `ignore_nulls` parameter is deprecated starting 4.0.0 and will be removed; use `allow_null_mismatch` instead.
fix Replace `ignore_nulls=True` with `allow_null_mismatch=True`.

Basic usage: compare two Spark DataFrames with a key column.

from pyspark.sql import SparkSession
from pyspark_regression import RegressionTest

spark = SparkSession.builder.appName('test').getOrCreate()
df1 = spark.createDataFrame([(1, 'a'), (2, 'b')], ['id', 'val'])
df2 = spark.createDataFrame([(1, 'a'), (2, 'b')], ['id', 'val'])

tester = RegressionTest()
result = tester.compare(df1, df2, key='id')
print(result)  # Should indicate equality