Boruta-Py Feature Selection
Boruta-Py is a Python implementation of the Boruta all-relevant feature selection algorithm, originally developed in R. It helps identify all features that are relevant to a prediction task, rather than just a minimal optimal subset. The library follows a scikit-learn-like interface, allowing seamless integration into existing machine learning workflows. As of version 0.4.3, the library is actively maintained with periodic releases addressing community-reported issues.
Common errors
-
TypeError: invalid key
cause Input features (X) or target (y) were passed as Pandas DataFrames instead of NumPy arrays.fixConvert DataFrames to NumPy arrays using the `.values` attribute: `feat_selector.fit(X.values, y.values)`. -
AttributeError: module 'numpy' has no attribute 'int'. Did you mean: 'inf'?
cause This error typically occurs when Boruta-Py (especially older versions) uses `np.int` which was deprecated in NumPy 1.20 and removed in NumPy 1.24. It often appears with Python 3.11+ and NumPy 1.24+.fixUpgrade Boruta-Py to version 0.4.0 or higher. If the problem persists, temporarily downgrade NumPy to a compatible version (e.g., `pip install numpy==1.23.1`). -
KeyError: 'max_depth'
cause This error can occur when `n_estimators='auto'` is used in `BorutaPy` with a non-tree-based estimator (e.g., `LogisticRegression`) that does not have a `max_depth` parameter, which `BorutaPy` tries to access internally.fixEnsure the `estimator` passed to `BorutaPy` is a tree-based model (e.g., `RandomForestClassifier`, `ExtraTreesClassifier`). If using a non-tree model, provide an integer value for `n_estimators` instead of `'auto'`.
Warnings
- gotcha BorutaPy expects NumPy arrays for X and y inputs. Passing Pandas DataFrames directly without converting to `.values` can lead to errors.
- breaking Older versions of Boruta-Py (prior to 0.4.x) may have compatibility issues with newer Python (e.g., 3.11+) and NumPy versions (e.g., 1.24+), specifically raising `AttributeError: module 'numpy' has no attribute 'int'`.
- gotcha Using `n_estimators='auto'` in `BorutaPy` with an underlying estimator that does not support an `n_estimators` parameter (e.g., `LogisticRegression`, `SVM`) will cause a `KeyError` or `ValueError`.
- gotcha Boruta-Py, especially on large datasets or with many features, can be computationally intensive and slow due to its iterative nature and reliance on ensemble methods.
Install
-
pip install Boruta
Imports
- BorutaPy
from boruta import BorutaPy
Quickstart
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from boruta import BorutaPy
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Initialize a Random Forest classifier (estimator for Boruta)
forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5, random_state=42)
# Initialize BorutaPy
# n_estimators='auto' automatically determines the number of trees
# verbose controls output: 0=no, 1=some, 2=detailed
feat_selector = BorutaPy(forest, n_estimators='auto', verbose=2, random_state=42)
# Fit Boruta on training data
# Note: X and y must be numpy arrays
feat_selector.fit(X, y)
# Print results
print("\n----- BorutaPy Results -----")
print(f"Selected features: {np.where(feat_selector.support_)[0]}")
print(f"Feature ranking: {feat_selector.ranking_}")
# Transform the dataset to include only selected features
X_filtered = feat_selector.transform(X)
print(f"Shape of original X: {X.shape}")
print(f"Shape of filtered X: {X_filtered.shape}")