Boruta-Py Feature Selection

0.4.3 · active · verified Thu Apr 16

Boruta-Py is a Python implementation of the Boruta all-relevant feature selection algorithm, originally developed in R. It helps identify all features that are relevant to a prediction task, rather than just a minimal optimal subset. The library follows a scikit-learn-like interface, allowing seamless integration into existing machine learning workflows. As of version 0.4.3, the library is actively maintained with periodic releases addressing community-reported issues.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `BorutaPy` for feature selection with a `RandomForestClassifier`. It loads the breast cancer dataset, initializes a classifier and `BorutaPy` selector, fits the selector to the data, and then prints the selected features and the transformed dataset. It explicitly converts data to NumPy arrays as required by `BorutaPy`.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from boruta import BorutaPy

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Initialize a Random Forest classifier (estimator for Boruta)
forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5, random_state=42)

# Initialize BorutaPy
# n_estimators='auto' automatically determines the number of trees
# verbose controls output: 0=no, 1=some, 2=detailed
feat_selector = BorutaPy(forest, n_estimators='auto', verbose=2, random_state=42)

# Fit Boruta on training data
# Note: X and y must be numpy arrays
feat_selector.fit(X, y)

# Print results
print("\n----- BorutaPy Results -----")
print(f"Selected features: {np.where(feat_selector.support_)[0]}")
print(f"Feature ranking: {feat_selector.ranking_}")

# Transform the dataset to include only selected features
X_filtered = feat_selector.transform(X)
print(f"Shape of original X: {X.shape}")
print(f"Shape of filtered X: {X_filtered.shape}")

view raw JSON →