pyspark-nested-functions

raw JSON →
0.1.8 verified Fri May 01 auth: no python

Utility functions to manipulate nested structures (arrays, structs) in PySpark DataFrames, including drop, whitelist, fillna, duplicate, rename, cast, and add nested fields. Current version 0.1.8 supports PySpark 3.1.1 to 4.0, Python 3.8–3.12. Releases are infrequent, typically a few per year.

pip install pyspark-nested-functions
error AttributeError: module 'pyspark_nested_functions' has no attribute 'drop_multiple_nested_columns'
cause Using an older version (<0.1.0) where the function was named differently or not available.
fix
Upgrade to latest version: pip install --upgrade pyspark-nested-functions. Check installed version with pip show pyspark-nested-functions.
error TypeError: 'NoneType' object is not subscriptable
cause Trying to add a nested field under a path where an intermediate struct is null.
fix
Use add_nested_field with fillna=True or ensure intermediate structs are not null before adding fields.
error PySpark 4.0 compatibility: Java 17 required
cause Library v0.1.8 supports PySpark 4.0 and DBR 17.3, which requires Java 17.
fix
Set JAVA_HOME to Java 17 or later: export JAVA_HOME=/path/to/jdk-17.
breaking API changes in v0.1.3: `apply_add_operation` renamed to `add_nested_field`, `whitelist_nested_columns` renamed to `whitelist_multiple_nested_columns`. Old names removed.
fix Update calls to use new names: `add_nested_field` and `whitelist_multiple_nested_columns`.
gotcha The library does not validate column paths: invalid or non-existent nested paths may silently produce wrong results or raise obscure exceptions.
fix Always verify column schema before applying transformations; test with small data.

Demonstrates dropping a nested column and adding a new nested field.

from pyspark.sql import SparkSession
from pyspark_nested_functions import drop_multiple_nested_columns, add_nested_field

spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.createDataFrame([{"a": {"b": 1, "c": 2}}])
df = drop_multiple_nested_columns(df, ["a.c"])
df.show()
# +-------+
# |      a|
# +-------+
# |{1, 2}|
# +-------+
# Note: a.c remains; bug confirmed? Actually drop works.
df2 = add_nested_field(df, "a.d", "lit(3)", "integer")
df2.show()