Pydantic to PyArrow Schema Conversion
pydantic-to-pyarrow is a Python library (current version 0.1.6) designed to facilitate the conversion of Pydantic models into Apache PyArrow schemas. It streamlines data processing pipelines by allowing validation with Pydantic and subsequent conversion to a columnar format for efficient processing with PyArrow, Pandas, or Polars, and storage in formats like Parquet. The library is actively maintained with regular feature releases.
Common errors
-
A module that was compiled using NumPy 1.x cannot be run in Numpy 2.x. This file was compiled with numpy 1.x and is trying to run with numpy 2.x.
cause Incompatibility between an older PyArrow version (pre-15.0) and a newer NumPy version (2.x).fixUpgrade `pyarrow` to version 15.0 or higher: `pip install --upgrade pyarrow`. Alternatively, if an upgrade is not possible, downgrade `numpy` to a 1.x version: `pip install "numpy<2"`. -
ERROR: Failed building wheel for pyarrow ERROR: Could not build wheels for pyarrow, which is required to install pyproject.toml-based projects
cause Often occurs when `pyarrow` is installed on a Python version for which pre-built wheels are not yet available (e.g., a very new Python release).fixCheck the PyArrow documentation for supported Python versions. Consider using a slightly older, supported Python version, or wait for PyArrow to release wheels for your specific Python version. Sometimes, installing build dependencies (e.g., `pip install cython setuptools wheel`) can help, but a missing wheel for the specific Python version is usually the root cause. -
TypeError: Converting Pydantic type to Arrow Type: unsupported type <some_type>
cause The Pydantic model contains a Python type (e.g., a custom type, or a standard library type not yet explicitly supported) that `pydantic-to-pyarrow` does not have a defined conversion for to a PyArrow type.fixReview the `pydantic-to-pyarrow` documentation or source for supported type conversions. If your type is not supported, consider transforming it to a compatible type within your Pydantic model (e.g., converting a custom object to a `str` or `dict`) or contributing support to the library. For Enums, ensure `pydantic-to-pyarrow` version is at least 0.1.2.
Warnings
- gotcha PyArrow versions less than 15.0 are incompatible with NumPy 2.x, which can lead to runtime errors (e.g., 'A module that was compiled using NumPy 1.x cannot be run in Numpy 2.x').
- gotcha Python's `int` type is unbounded, but PyArrow's `pa.int64()` has a fixed maximum. Large Python integers may overflow when converted, leading to data loss or unexpected values.
- gotcha When creating PyArrow tables from Pydantic models that include `UUID` fields, especially with PyArrow 19.0+, `pa.Table.from_pylist` expects bytes, not `UUID` objects directly. This requires adding a serializer to your Pydantic model to convert UUIDs to bytes.
- gotcha By default, converting timezone-aware Python datetimes will raise an exception to prevent loss of timezone information. The generated PyArrow schema will use `timestamp[ns]` without timezone.
Install
-
pip install pydantic-to-pyarrow
Imports
- get_pyarrow_schema
from pydantic_to_pyarrow import get_pyarrow_schema
Quickstart
import pyarrow as pa
from pydantic import BaseModel
from typing import List, Optional
from datetime import datetime
from uuid import UUID
from pydantic_to_pyarrow import get_pyarrow_schema
class Address(BaseModel):
street: str
zip_code: int
class Person(BaseModel):
name: str
age: int
height_cm: Optional[float]
is_active: bool
created_at: datetime
uuid_id: UUID
tags: List[str] = []
address: Address
# Convert the Pydantic model to a PyArrow Schema
arrow_schema = get_pyarrow_schema(Person)
print(arrow_schema)
# Expected output (order of fields may vary slightly depending on Pydantic version):
# name: string
# age: int64
# height_cm: double
# is_active: bool
# created_at: timestamp[ns]
# uuid_id: fixed_size_binary[16]
# tags: list<item: string>
# child 0, item: string
# address: struct<street: string, zip_code: int64>
# child 0, street: string
# child 1, zip_code: int64