Sparkdantic

2.8.0 · active · verified Sat Apr 11

Sparkdantic is a Python library that bridges Pydantic models with PySpark schemas. It allows developers to define data structures using Pydantic, then automatically generate equivalent `pyspark.sql.types.StructType` schemas for use in Spark DataFrames. This simplifies data validation and schema management across Python applications and Spark environments. The current version is 2.8.0, and it maintains an active release cadence, frequently updating for Pydantic and PySpark compatibility.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to define a Pydantic model and use `create_spark_schema` to generate a corresponding PySpark `StructType`. It then shows how to initialize a SparkSession and create an empty DataFrame using the generated schema, printing both the raw schema and the DataFrame's schema for verification.

from typing import Optional, List
from pydantic import BaseModel
from sparkdantic import create_spark_schema
from pyspark.sql import SparkSession

# Define your Pydantic model
class Product(BaseModel):
    product_id: int
    name: str
    price: float
    description: Optional[str] = None
    tags: List[str] = []

# Generate the Spark schema from the Pydantic model
spark_schema = create_spark_schema(Product)

# Print the generated Spark schema (useful for verification)
print("Generated Spark Schema:")
print(spark_schema)

# Optionally, use it with a Spark DataFrame
spark = SparkSession.builder.appName("SparkdanticExample").getOrCreate()

# Create an empty DataFrame with the defined schema
df = spark.createDataFrame([], schema=spark_schema)

print("\nDataFrame created with generated schema:")
df.printSchema()

spark.stop()

view raw JSON →