MissingPy: Missing Data Imputation
MissingPy is a Python library providing tools for missing data imputation, offering an API consistent with scikit-learn. It primarily supports k-Nearest Neighbors (KNNImputer) and Random Forest-based (MissForest) imputation algorithms. The current version is 0.2.0, released in December 2018. Due to infrequent updates since its last release and limited recent activity on its GitHub repository, the library is considered to be in a maintenance state, with no active development or new releases anticipated.
Common errors
-
ModuleNotFoundError: No module named 'missingpy'
cause The `missingpy` package is not installed in the active Python environment.fixRun `pip install missingpy` to install the library. If using a virtual environment or conda, ensure it's activated before installation. -
ImportError: cannot import name '_check_weights' from 'sklearn.neighbors._base'
cause This error occurs when `missingpy` attempts to import internal modules from `scikit-learn` that have changed or been removed in newer `scikit-learn` versions (typically >= 1.0).fixImplement the `scikit-learn` compatibility workaround: add `import sklearn.neighbors._base; import sys; sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base` before any `missingpy` imports. Additionally, consider pinning `scikit-learn` to a compatible older version (e.g., `scikit-learn==1.1.2`) and `scipy` (e.g., `scipy==1.9.1`). -
'could not convert string to float: 'CategoryName'
cause You are attempting to use a `missingpy` imputer (like `MissForest`) on a DataFrame or array that contains non-numerical (e.g., string or object) categorical columns without prior encoding.fixOne-hot encode or label encode your categorical features into numerical representations before passing the data to `missingpy` imputers. For example, use `pandas.get_dummies()` or `sklearn.preprocessing.OneHotEncoder`.
Warnings
- breaking MissingPy has severe compatibility issues with recent versions of `scikit-learn` (e.g., >=1.0) due to reliance on internal `sklearn.neighbors` modules that have been reorganized or removed. This often leads to `ImportError`.
- gotcha MissingPy's `MissForest` algorithm expects numerical input. If your dataset contains categorical variables, they must be explicitly one-hot encoded (dummy encoded) before passing them to the imputer, otherwise, it will raise an error like 'could not convert string to float'.
- gotcha The `missingpy` library is no longer actively maintained. The last PyPI release was in December 2018, and the GitHub repository shows minimal activity since. This means there will likely be no official updates for newer Python versions, `scikit-learn` compatibility, or bug fixes.
Install
-
pip install missingpy
Imports
- KNNImputer
from missingpy import KNNImputer
- MissForest
from missingpy.missforest import MissForest
from missingpy import MissForest
- MissForest (with sklearn workaround)
import sklearn.neighbors._base import sys sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base from missingpy import MissForest
Quickstart
import numpy as np
# Workaround for scikit-learn compatibility
import sklearn.neighbors._base
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest
nan = np.nan
X = np.array([
[1, 2, nan],
[3, 4, 3],
[nan, 6, 5],
[8, 8, 7]
])
imputer = MissForest(random_state=42)
X_imputed = imputer.fit_transform(X)
print("Original Data with NaNs:\n", X)
print("Imputed Data:\n", X_imputed)