Patsy
Patsy is a Python package for describing statistical models and for building design matrices, bringing R-style formulas to Python. The current version is 1.0.2. While no new feature development is planned, it maintains a maintenance cadence to ensure compatibility with current releases in the Python ecosystem.
Warnings
- breaking Python 2.7 support was dropped in version 1.0.0. Projects still on Python 2 must use an older version of Patsy.
- gotcha Patsy automatically adds an intercept term and uses treatment coding for categorical variables (dropping one level). If you need to include all levels or omit the intercept, adjust your formula accordingly (e.g., `y ~ x1 + C(a) - 1` to remove intercept and explicitly code `a`).
- gotcha The `NA_action='drop'` is the default for `dmatrix` and `dmatrices`, which means rows containing any missing values will be silently dropped. This can lead to unexpected data loss if not anticipated.
- gotcha Operators like `**` in Patsy formulas are interpreted as interaction effects, not Python's power operator. Use `I()` (identity function) to force Python's interpretation (e.g., `I(x**2)`).
- gotcha Patsy fixed compatibility issues with `numpy >= 2` in version 1.0.0. Older versions might not work correctly with newer NumPy.
- gotcha Patsy fixed compatibility with Pandas 3's new `StringDtype` in version 1.0.2. Older versions may encounter issues with Pandas 3.
- deprecated The project is explicitly stated as 'no longer under active development' for new features, with 'Formulaic' identified as its spiritual successor. For new projects, considering Formulaic might be beneficial.
Install
-
pip install patsy
Imports
- dmatrix
from patsy import dmatrix
- dmatrices
from patsy import dmatrices
- demo_data
from patsy import demo_data
Quickstart
import numpy as np
from patsy import dmatrices, demo_data
# Create example data
data = demo_data("a", "b", "x1", "x2", "y")
# Generate design matrices for a linear model
y, X = dmatrices("y ~ x1 + x2 + a", data=data)
print("Dependent variable (y):")
print(y)
print("\nIndependent variables (X):")
print(X)