Pandera Data Validation
Pandera is a lightweight and flexible open-source Python library for data validation and testing statistical data objects, such as Pandas DataFrames and Series. It allows users to define schema objects to validate the structure, types, and values of data, ensuring data quality and preventing unexpected errors. The library is actively maintained, with version 0.30.1 currently available, and undergoes frequent minor releases.
Warnings
- breaking Pandera dropped support for Python 3.9 in version 0.27.0. Users on Python 3.9 must use an older Pandera version.
- breaking Pandera v0.27.1 fixed a regression with `numpy==2.4.0`. Users on `pandera==0.27.0` paired with `numpy==2.4.0` may encounter `ValueError` related to type recognition.
- gotcha Pandera v0.30.0 introduced support for Pandas >=3.0. While this is an enhancement, older Pandera versions (<0.30.0) are not compatible with Pandas 3.0 and will likely fail.
- gotcha To use Pandera with alternative DataFrame backends like Polars or PySpark, you must install additional dependencies via extras (e.g., `pip install pandera[polars]`). Core installation only supports Pandas.
Install
-
pip install pandera -
pip install pandera[polars] -
pip install pandera[pyspark]
Imports
- pandera
import pandera as pa
- DataFrameSchema
from pandera import DataFrameSchema
- Column
from pandera import Column
- Check
from pandera import Check
- SeriesSchema
from pandera import SeriesSchema
Quickstart
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Check
# 1. Define a DataFrameSchema
schema = DataFrameSchema(
columns={
"id": Column(int, Check.greater_than_or_equal_to(0)),
"name": Column(str, Check.str_matches(r"^[A-Za-z]+$")),
"value": Column(float, Check.in_range(0.0, 1.0))
},
# Optionally specify index validation
index=pa.Index(int, name="index"),
# Ensure no extra columns exist
strict=True
)
# 2. Create a valid DataFrame
valid_df = pd.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"value": [0.1, 0.5, 0.9]
})
# 3. Validate the DataFrame
try:
validated_df = schema.validate(valid_df)
print("Valid DataFrame validated successfully:")
print(validated_df)
except pa.errors.SchemaErrors as e:
print(f"Validation failed unexpectedly for valid data: {e}")
# 4. Create an invalid DataFrame to demonstrate error handling
invalid_df = pd.DataFrame({
"id": [-1, 2, 3], # Fails 'greater_than_or_equal_to(0)'
"name": ["Alice", "Bob1", "Charlie"], # Fails 'str_matches'
"value": [0.1, 0.5, 1.5], # Fails 'in_range'
"extra_col": [1, 2, 3] # Fails 'strict=True'
})
try:
schema.validate(invalid_df)
except pa.errors.SchemaErrors as e:
print("\nInvalid DataFrame caught by schema errors:")
print(e.failure_cases)
print(f"Total errors: {e.n_failures}")