pyspark-nested-functions
raw JSON → 0.1.8 verified Fri May 01 auth: no python
Utility functions to manipulate nested structures (arrays, structs) in PySpark DataFrames, including drop, whitelist, fillna, duplicate, rename, cast, and add nested fields. Current version 0.1.8 supports PySpark 3.1.1 to 4.0, Python 3.8–3.12. Releases are infrequent, typically a few per year.
pip install pyspark-nested-functions Common errors
error AttributeError: module 'pyspark_nested_functions' has no attribute 'drop_multiple_nested_columns' ↓
cause Using an older version (<0.1.0) where the function was named differently or not available.
fix
Upgrade to latest version:
pip install --upgrade pyspark-nested-functions. Check installed version with pip show pyspark-nested-functions. error TypeError: 'NoneType' object is not subscriptable ↓
cause Trying to add a nested field under a path where an intermediate struct is null.
fix
Use
add_nested_field with fillna=True or ensure intermediate structs are not null before adding fields. error PySpark 4.0 compatibility: Java 17 required ↓
cause Library v0.1.8 supports PySpark 4.0 and DBR 17.3, which requires Java 17.
fix
Set JAVA_HOME to Java 17 or later:
export JAVA_HOME=/path/to/jdk-17. Warnings
breaking API changes in v0.1.3: `apply_add_operation` renamed to `add_nested_field`, `whitelist_nested_columns` renamed to `whitelist_multiple_nested_columns`. Old names removed. ↓
fix Update calls to use new names: `add_nested_field` and `whitelist_multiple_nested_columns`.
gotcha The library does not validate column paths: invalid or non-existent nested paths may silently produce wrong results or raise obscure exceptions. ↓
fix Always verify column schema before applying transformations; test with small data.
Imports
- drop_multiple_nested_columns wrong
from pyspark_nested_functions.nested_functions import drop_multiple_nested_columnscorrectfrom pyspark_nested_functions import drop_multiple_nested_columns - whitelist_multiple_nested_columns wrong
from pyspark_nested_functions import whitelist_nested_columnscorrectfrom pyspark_nested_functions import whitelist_multiple_nested_columns - add_nested_field wrong
from pyspark_nested_functions import apply_add_operationcorrectfrom pyspark_nested_functions import add_nested_field
Quickstart
from pyspark.sql import SparkSession
from pyspark_nested_functions import drop_multiple_nested_columns, add_nested_field
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.createDataFrame([{"a": {"b": 1, "c": 2}}])
df = drop_multiple_nested_columns(df, ["a.c"])
df.show()
# +-------+
# | a|
# +-------+
# |{1, 2}|
# +-------+
# Note: a.c remains; bug confirmed? Actually drop works.
df2 = add_nested_field(df, "a.d", "lit(3)", "integer")
df2.show()