Pandera Data Validation

0.30.1 · active · verified Thu Apr 09

Pandera is a lightweight and flexible open-source Python library for data validation and testing statistical data objects, such as Pandas DataFrames and Series. It allows users to define schema objects to validate the structure, types, and values of data, ensuring data quality and preventing unexpected errors. The library is actively maintained, with version 0.30.1 currently available, and undergoes frequent minor releases.

Warnings

Install

Imports

Quickstart

This quickstart defines a `DataFrameSchema` with column and index constraints, then demonstrates validating both a valid and an invalid Pandas DataFrame. It also shows how to catch `SchemaErrors` and inspect `failure_cases`.

import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Check

# 1. Define a DataFrameSchema
schema = DataFrameSchema(
    columns={
        "id": Column(int, Check.greater_than_or_equal_to(0)),
        "name": Column(str, Check.str_matches(r"^[A-Za-z]+$")),
        "value": Column(float, Check.in_range(0.0, 1.0))
    },
    # Optionally specify index validation
    index=pa.Index(int, name="index"),
    # Ensure no extra columns exist
    strict=True
)

# 2. Create a valid DataFrame
valid_df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "value": [0.1, 0.5, 0.9]
})

# 3. Validate the DataFrame
try:
    validated_df = schema.validate(valid_df)
    print("Valid DataFrame validated successfully:")
    print(validated_df)
except pa.errors.SchemaErrors as e:
    print(f"Validation failed unexpectedly for valid data: {e}")

# 4. Create an invalid DataFrame to demonstrate error handling
invalid_df = pd.DataFrame({
    "id": [-1, 2, 3], # Fails 'greater_than_or_equal_to(0)'
    "name": ["Alice", "Bob1", "Charlie"], # Fails 'str_matches'
    "value": [0.1, 0.5, 1.5], # Fails 'in_range'
    "extra_col": [1, 2, 3] # Fails 'strict=True'
})

try:
    schema.validate(invalid_df)
except pa.errors.SchemaErrors as e:
    print("\nInvalid DataFrame caught by schema errors:")
    print(e.failure_cases)
    print(f"Total errors: {e.n_failures}")

view raw JSON →