Patsy

raw JSON →
1.0.2 verified Tue May 12 auth: no python install: verified quickstart: verified maintenance

Patsy is a Python package for describing statistical models and for building design matrices, bringing R-style formulas to Python. The current version is 1.0.2. While no new feature development is planned, it maintains a maintenance cadence to ensure compatibility with current releases in the Python ecosystem.

pip install patsy
error ModuleNotFoundError: No module named 'patsy'
cause The `patsy` library is not installed in the Python environment where you are trying to use it, or the environment is not correctly activated.
fix
Install patsy using pip or conda: pip install patsy or conda install patsy.
error patsy.PatsyError: Error evaluating factor: TypeError: 'int' object is not callable b ~ C(a)
cause This error occurs when a variable named `C` (or another built-in Patsy function name like `I` or `Q`) exists in your global or local Python namespace and conflicts with Patsy's attempt to use its own built-in `C()` function for categorical variables.
fix
Rename your conflicting variable (e.g., change C = 1 to my_C = 1). Alternatively, for formula evaluation, you can explicitly pass eval_env=0 (or a specific evaluation environment) to patsy.dmatrix or statsmodels.formula.api.ols to prevent local namespace lookups: sm.ols('b ~ C(a)', data=df, eval_env=0).fit().
error PatsyError: Error evaluating factor: NameError: no data named 'some_variable' found
cause The variable specified in the Patsy formula (e.g., 'some_variable') does not exist as a column in the DataFrame provided to `patsy.dmatrix` or `statsmodels.formula.api.ols`, or it contains invalid characters.
fix
Ensure that all variable names in your formula exactly match the column names in your DataFrame. If column names contain special characters (like '-', '+', ' '), rename them to be valid Python identifiers, or wrap them in Q() in the formula (e.g., Q('CFC-11')).
error PatsyError: Number of rows mismatch between data argument and column (statsmodels)
cause There is an inconsistency in the number of rows between the outcome variable and the predictor variables, often caused by missing values (`NaN`) in the data which Patsy handles by dropping entire rows by default, or an issue with the indexing of the input data.
fix
Inspect your DataFrame for missing values (df.isnull().sum()) in the columns used in the formula. Patsy drops rows with NaNs by default. Ensure your data is clean and aligned before passing it to Patsy, or explicitly handle missing values (e.g., imputation or dropping them manually before calling Patsy).
breaking Python 2.7 support was dropped in version 1.0.0. Projects still on Python 2 must use an older version of Patsy.
fix Upgrade to Python 3 or pin `patsy<1.0.0`.
gotcha Patsy automatically adds an intercept term and uses treatment coding for categorical variables (dropping one level). If you need to include all levels or omit the intercept, adjust your formula accordingly (e.g., `y ~ x1 + C(a) - 1` to remove intercept and explicitly code `a`).
fix Use `- 1` in the formula to remove the intercept, or `C(variable, contr.treatment)` for explicit coding options.
gotcha The `NA_action='drop'` is the default for `dmatrix` and `dmatrices`, which means rows containing any missing values will be silently dropped. This can lead to unexpected data loss if not anticipated.
fix Explicitly set `NA_action='raise'` to catch missing values or implement custom handling before calling Patsy functions.
gotcha Operators like `**` in Patsy formulas are interpreted as interaction effects, not Python's power operator. Use `I()` (identity function) to force Python's interpretation (e.g., `I(x**2)`).
fix Wrap expressions meant for literal Python evaluation in `I()`, like `I(x**2)`.
gotcha Patsy fixed compatibility issues with `numpy >= 2` in version 1.0.0. Older versions might not work correctly with newer NumPy.
fix Upgrade to `patsy>=1.0.0` if using `numpy>=2`.
gotcha Patsy fixed compatibility with Pandas 3's new `StringDtype` in version 1.0.2. Older versions may encounter issues with Pandas 3.
fix Upgrade to `patsy>=1.0.2` if using `pandas>=3`.
deprecated The project is explicitly stated as 'no longer under active development' for new features, with 'Formulaic' identified as its spiritual successor. For new projects, considering Formulaic might be beneficial.
fix Consider migrating to 'Formulaic' for new projects, while existing projects can continue to use Patsy for maintenance.
python os / libc status wheel install import disk
3.10 alpine (musl) wheel - 0.26s 90.8M
3.10 alpine (musl) - - 0.25s 90.8M
3.10 slim (glibc) wheel 3.6s 0.21s 87M
3.10 slim (glibc) - - 0.19s 87M
3.11 alpine (musl) wheel - 0.31s 98.5M
3.11 alpine (musl) - - 0.37s 98.5M
3.11 slim (glibc) wheel 3.6s 0.32s 94M
3.11 slim (glibc) - - 0.32s 94M
3.12 alpine (musl) wheel - 0.29s 86.9M
3.12 alpine (musl) - - 0.31s 86.9M
3.12 slim (glibc) wheel 3.4s 0.25s 82M
3.12 slim (glibc) - - 0.29s 82M
3.13 alpine (musl) wheel - 0.25s 86.4M
3.13 alpine (musl) - - 0.29s 86.3M
3.13 slim (glibc) wheel 3.4s 0.30s 82M
3.13 slim (glibc) - - 0.29s 82M
3.9 alpine (musl) wheel - 0.24s 98.6M
3.9 alpine (musl) - - 0.22s 98.6M
3.9 slim (glibc) wheel 4.3s 0.19s 97M
3.9 slim (glibc) - - 0.21s 97M

This quickstart demonstrates how to use `patsy.dmatrices` to generate design matrices from a formula string and a dictionary-like data source. It automatically handles categorical variables and adds an intercept term.

import numpy as np
from patsy import dmatrices, demo_data

# Create example data
data = demo_data("a", "b", "x1", "x2", "y")

# Generate design matrices for a linear model
y, X = dmatrices("y ~ x1 + x2 + a", data=data)

print("Dependent variable (y):")
print(y)
print("\nIndependent variables (X):")
print(X)