{"id":4262,"library":"sparkdantic","title":"Sparkdantic","description":"Sparkdantic is a Python library that bridges Pydantic models with PySpark schemas. It allows developers to define data structures using Pydantic, then automatically generate equivalent `pyspark.sql.types.StructType` schemas for use in Spark DataFrames. This simplifies data validation and schema management across Python applications and Spark environments. The current version is 2.8.0, and it maintains an active release cadence, frequently updating for Pydantic and PySpark compatibility.","status":"active","version":"2.8.0","language":"en","source_language":"en","source_url":"https://github.com/sparkdantic/sparkdantic","tags":["pyspark","pydantic","schema generation","data validation","etl"],"install":[{"cmd":"pip install sparkdantic pyspark","lang":"bash","label":"Install sparkdantic and pyspark"}],"dependencies":[{"reason":"Core dependency for defining data models. Supports Pydantic v1 and v2.","package":"pydantic","optional":false},{"reason":"Required for Spark schema generation and integration with Spark DataFrames.","package":"pyspark","optional":false}],"imports":[{"symbol":"create_spark_schema","correct":"from sparkdantic import create_spark_schema"},{"note":"Pydantic's BaseModel is used to define the source models for schema generation.","symbol":"BaseModel","correct":"from pydantic import BaseModel"}],"quickstart":{"code":"from typing import Optional, List\nfrom pydantic import BaseModel\nfrom sparkdantic import create_spark_schema\nfrom pyspark.sql import SparkSession\n\n# Define your Pydantic model\nclass Product(BaseModel):\n    product_id: int\n    name: str\n    price: float\n    description: Optional[str] = None\n    tags: List[str] = []\n\n# Generate the Spark schema from the Pydantic model\nspark_schema = create_spark_schema(Product)\n\n# Print the generated Spark schema (useful for verification)\nprint(\"Generated Spark Schema:\")\nprint(spark_schema)\n\n# Optionally, use it with a Spark DataFrame\nspark = SparkSession.builder.appName(\"SparkdanticExample\").getOrCreate()\n\n# Create an empty DataFrame with the defined schema\ndf = spark.createDataFrame([], schema=spark_schema)\n\nprint(\"\\nDataFrame created with generated schema:\")\ndf.printSchema()\n\nspark.stop()","lang":"python","description":"This quickstart demonstrates how to define a Pydantic model and use `create_spark_schema` to generate a corresponding PySpark `StructType`. It then shows how to initialize a SparkSession and create an empty DataFrame using the generated schema, printing both the raw schema and the DataFrame's schema for verification."},"warnings":[{"fix":"Upgrade sparkdantic to 2.0.0 or newer (e.g., `pip install sparkdantic>=2.0.0`). The current version 2.8.0 fully supports Pydantic V1 and V2.","message":"Older versions of sparkdantic (< 2.0.0) do not support Pydantic V2. If you are upgrading your Pydantic dependency to V2, ensure you also upgrade sparkdantic to version 2.0.0 or higher.","severity":"breaking","affected_versions":"<2.0.0"},{"fix":"Be aware of this automatic mapping. For stricter validation in Spark, consider UDFs or explicit casting after DataFrame creation based on the string values.","message":"Pydantic `Enum` types are automatically mapped to `StringType` in Spark schemas. If you require more specific type handling or validation for enums within Spark, you might need to implement custom logic post-schema generation.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Store UUIDs as strings in Spark DataFrames. If direct UUID object handling is needed in Spark, it typically involves string parsing within UDFs.","message":"Pydantic `uuid.UUID` types are automatically converted to `StringType` in the generated Spark schema. Spark does not have a native UUID type, so string representation is the default and most common approach.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}