pydantic-spark

raw JSON →
1.0.1 verified Mon Apr 27 auth: no python

Converts Pydantic models to PySpark schemas. Current version 1.0.1 supports Pydantic v2. Release cadence is irregular. Designed for data engineering pipelines where Pydantic models define data contracts and Spark schemas must be inferred.

pip install pydantic-spark
error ImportError: cannot import name 'to_spark_schema' from 'pydantic_spark'
cause Installed version <0.3.0 does not have the API; or typo in import path.
fix
Upgrade to latest version: pip install --upgrade pydantic-spark. Use correct import: from pydantic_spark import to_spark_schema
error pyspark.sql.utils.AnalysisException: u'Unable to infer schema for type. It must be specified manually.;'
cause The generated schema is incomplete or wrong for complex Pydantic models (e.g., Union types).
fix
Simplify the model to avoid Union/Optional; or provide explicit schema via pyspark's StructType.
error AttributeError: module 'pydantic_spark' has no attribute 'to_spark_schema'
cause The function was renamed or moved in version 1.0.0.
fix
Use from pydantic_spark import to_spark_schema directly. If using v0.x, use from pydantic_spark.converter import to_spark_schema.
breaking Version 1.0.0 dropped support for Pydantic v1. If upgrading from v0.3.0 or earlier, you must migrate your models to Pydantic v2.
fix Update Pydantic to v2 and follow their migration guide (https://docs.pydantic.dev/latest/migration/)
gotcha Complex nested types (e.g., models with Union, Optional, or recursive references) may produce unexpected Spark types. Manual schema adjustments might be needed.
fix Inspect the generated schema and override using pydantic Field(..., schema_extra={...}) or custom serialization.
deprecated The 'coerce' feature (CoerceType) from v0.3.0 is deprecated in v1.0.0+. Use Pydantic's built-in validators instead.
fix Remove usage of CoerceType and replace with @field_validator or @model_validator in your Pydantic model.

Basic usage: define a Pydantic model and convert to Spark schema.

from pyspark.sql import SparkSession
from pydantic import BaseModel
from pydantic_spark import to_spark_schema

class MyModel(BaseModel):
    name: str
    age: int

spark = SparkSession.builder.getOrCreate()
schema = to_spark_schema(MyModel)
df = spark.createDataFrame([], schema)
print(df.schema)
spark.stop()