Formulaic (Python for Wilkinson Formulas)
Formulaic is a high-performance Python library that implements Wilkinson formulas for statistical modeling. It simplifies feature engineering by providing extensible formula parsing and high-performance dataframe to model-matrix conversions. It supports various data input/output formats including pandas DataFrames, NumPy arrays, SciPy sparse matrices, and Narwhals dataframes. The library is actively maintained and currently at version 1.2.1.
Warnings
- gotcha Formulaic, following Wilkinson formula conventions, automatically adds an intercept term (unless explicitly removed) and typically uses treatment coding for categorical variables by default. This might differ from expectations if coming from other statistical packages or manual feature engineering methods.
- gotcha While `model_matrix` provides a convenient shorthand, direct use of `Formula('...').get_model_matrix()` is recommended for scenarios where you need to inspect the compiled formula structure, or reuse the generated `ModelSpec` to ensure consistent transformations across multiple datasets (e.g., training and testing data).
- gotcha Formulaic is optimized for working with tabular data, most commonly `pandas.DataFrame` for input. While it supports other data structures (NumPy arrays, SciPy sparse matrices, Narwhals dataframes), inconsistencies in input formats or unexpected data types within columns can lead to errors.
Install
-
pip install formulaic
Imports
- Formula
from formulaic import Formula
- model_matrix
from formulaic import model_matrix
Quickstart
import pandas
from formulaic import Formula, model_matrix
df = pandas.DataFrame({
'y': [0, 1, 2],
'x': ['A', 'B', 'C'],
'z': [0.3, 0.1, 0.2],
})
print("Using Formula class (recommended for advanced use/reuse):")
f = Formula('y ~ x + z')
y_formula, X_formula = f.get_model_matrix(df)
print("Response (y) from Formula:\n", y_formula)
print("Design Matrix (X) from Formula:\n", X_formula)
print("\nUsing model_matrix shorthand:")
y_short, X_short = model_matrix('y ~ x + z', df)
print("Response (y) from model_matrix:\n", y_short)
print("Design Matrix (X) from model_matrix:\n", X_short)