MissingPy: Missing Data Imputation

0.2.0 · maintenance · verified Thu Apr 16

MissingPy is a Python library providing tools for missing data imputation, offering an API consistent with scikit-learn. It primarily supports k-Nearest Neighbors (KNNImputer) and Random Forest-based (MissForest) imputation algorithms. The current version is 0.2.0, released in December 2018. Due to infrequent updates since its last release and limited recent activity on its GitHub repository, the library is considered to be in a maintenance state, with no active development or new releases anticipated.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `MissForest` to impute missing values (represented by `np.nan`) in a NumPy array. It includes the necessary workaround for `scikit-learn` compatibility that is commonly required. Ensure categorical variables are one-hot encoded before passing them to the imputer.

import numpy as np
# Workaround for scikit-learn compatibility
import sklearn.neighbors._base
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base

from missingpy import MissForest

nan = np.nan
X = np.array([
    [1, 2, nan],
    [3, 4, 3],
    [nan, 6, 5],
    [8, 8, 7]
])

imputer = MissForest(random_state=42)
X_imputed = imputer.fit_transform(X)

print("Original Data with NaNs:\n", X)
print("Imputed Data:\n", X_imputed)

view raw JSON →