statsmodels
statsmodels is a Python package offering a wide array of statistical models, hypothesis tests, and statistical data exploration tools. It provides classes and functions for the estimation of many different statistical models, including linear regression, generalized linear models, discrete choice models, and time series analysis. Currently at version 0.14.6, the library follows a loose, long time-based release cycle for its dependencies, typically updating minimal versions every one and a half to two years. [2, 3, 5, 7]
Warnings
- breaking The `scikits` namespace was deprecated and eventually removed in versions prior to 0.5.0. Direct imports from `scikits.statsmodels` are no longer valid.
- breaking The signature of `model.predict` methods changed in versions prior to 0.5.0. It now explicitly requires the `params` argument (e.g., `model.predict(params, exog)`), rather than assuming the model has already been fit and omitting `params`.
- deprecated The `statsmodels.tsa.arima_model.ARMA` and `statsmodels.tsa.arima_model.ARIMA` classes have been deprecated. Using them will raise a `FutureWarning`.
- gotcha When using the direct `statsmodels.api.OLS(y, X)` interface (without formulas), an intercept term (constant) is NOT automatically added to the `X` (exog) design matrix. This differs from some other statistical software and can lead to incorrect models if an intercept is expected.
- breaking Pandas' `Panel` object and `pandas.stats.ols` (among others) were deprecated and removed in Pandas 0.20.1 and later. Users relying on these for panel data or OLS directly from Pandas will need to switch.
- breaking Statsmodels 0.14.2 introduced compatibility with NumPy 2.0.0. While `statsmodels` itself may run on older NumPy versions, if you upgrade to NumPy 2.0, all other Python scientific stack dependencies (like SciPy and Pandas) *must also be NumPy 2.0 compatible* to avoid runtime issues. This release also increased the minimum Python version to 3.9 to match NumPy 2.0.
Install
-
pip install statsmodels
Imports
- statsmodels.api
import statsmodels.api as sm
- statsmodels.formula.api
import statsmodels.formula.api as smf
- Specific Submodule
from statsmodels.tsa.arima.model import ARIMA
Quickstart
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
# 1. Create a sample DataFrame
np.random.seed(42)
data = {
'y': 10 + 2 * np.random.rand(100) + 3 * np.random.randn(100),
'x1': np.random.rand(100) * 10,
'x2': np.random.randint(0, 2, 100) # categorical variable example
}
df = pd.DataFrame(data)
# 2. Fit OLS (Ordinary Least Squares) model using R-style formula
# 'y ~ x1 + C(x2)' means y is dependent on x1 and categorical x2
model = smf.ols('y ~ x1 + C(x2)', data=df)
results = model.fit()
# 3. Print the summary of the regression results
print(results.summary())