SparkAid

1.0.0 · maintenance · verified Thu Apr 16

SparkAid is a Python utility library (version 1.0.0) designed to simplify common data manipulation tasks in Apache Spark, particularly for DataFrames with complex, nested schemas. It provides functions to address challenges like schema flattening and working with structured types. The library has a slow release cadence, with its latest version released in August 2022.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates how to initialize a SparkSession, create a DataFrame with nested structures, and use `sparkaid.flatten` to unnest the schema. Note the use of `arrays_to_unpack` for array types.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType
from sparkaid import flatten

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("SparkAidQuickstart") \
    .master("local[*]") \
    .getOrCreate()

# Create a sample DataFrame with nested structure
data = [
    ("Alice", {"city": "New York", "zip": 10001}, ["apple", "banana"]),
    ("Bob", {"city": "Los Angeles", "zip": 90001}, ["orange"]),
    ("Charlie", None, ["grape", "kiwi", "mango"])
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("address", StructType([
        StructField("city", StringType(), True),
        StructField("zip", IntegerType(), True)
    ]), True),
    StructField("fruits", ArrayType(StringType()), True)
])

df = spark.createDataFrame(data, schema)
print("Original Schema:")
df.printSchema()

print("\nFlattening DataFrame:")
# Flatten the DataFrame. By default, it flattens StructTypes.
# For array flattening, 'arrays_to_unpack=["*"]' is needed as per v1.0.0 breaking change.
flattened_df = flatten(df, nested_struct_separator="__", arrays_to_unpack=["fruits"])

print("Flattened Schema:")
flattened_df.printSchema()
print("Flattened Data:")
flattened_df.show()

spark.stop()

view raw JSON →