{"id":7052,"library":"boruta","title":"Boruta-Py Feature Selection","description":"Boruta-Py is a Python implementation of the Boruta all-relevant feature selection algorithm, originally developed in R. It helps identify all features that are relevant to a prediction task, rather than just a minimal optimal subset. The library follows a scikit-learn-like interface, allowing seamless integration into existing machine learning workflows. As of version 0.4.3, the library is actively maintained with periodic releases addressing community-reported issues.","status":"active","version":"0.4.3","language":"en","source_language":"en","source_url":"https://github.com/danielhomola/boruta_py","tags":["feature selection","machine learning","scikit-learn","random forest","ensemble methods","data preprocessing"],"install":[{"cmd":"pip install Boruta","lang":"bash","label":"Install via pip"}],"dependencies":[{"reason":"Fundamental for numerical operations and array handling.","package":"numpy"},{"reason":"Used for scientific computing and statistical tests within the algorithm.","package":"scipy"},{"reason":"Provides the base estimator interface and utility functions (e.g., `RandomForestClassifier`).","package":"scikit-learn"}],"imports":[{"symbol":"BorutaPy","correct":"from boruta import BorutaPy"}],"quickstart":{"code":"import numpy as np\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.datasets import load_breast_cancer\nfrom boruta import BorutaPy\n\n# Load dataset\ndata = load_breast_cancer()\nX = data.data\ny = data.target\n\n# Initialize a Random Forest classifier (estimator for Boruta)\nforest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5, random_state=42)\n\n# Initialize BorutaPy\n# n_estimators='auto' automatically determines the number of trees\n# verbose controls output: 0=no, 1=some, 2=detailed\nfeat_selector = BorutaPy(forest, n_estimators='auto', verbose=2, random_state=42)\n\n# Fit Boruta on training data\n# Note: X and y must be numpy arrays\nfeat_selector.fit(X, y)\n\n# Print results\nprint(\"\\n----- BorutaPy Results -----\")\nprint(f\"Selected features: {np.where(feat_selector.support_)[0]}\")\nprint(f\"Feature ranking: {feat_selector.ranking_}\")\n\n# Transform the dataset to include only selected features\nX_filtered = feat_selector.transform(X)\nprint(f\"Shape of original X: {X.shape}\")\nprint(f\"Shape of filtered X: {X_filtered.shape}\")","lang":"python","description":"This quickstart demonstrates how to use `BorutaPy` for feature selection with a `RandomForestClassifier`. It loads the breast cancer dataset, initializes a classifier and `BorutaPy` selector, fits the selector to the data, and then prints the selected features and the transformed dataset. It explicitly converts data to NumPy arrays as required by `BorutaPy`."},"warnings":[{"fix":"Always convert Pandas DataFrames to NumPy arrays before passing them to `BorutaPy.fit()` or `transform()`: `X.values`, `y.values`.","message":"BorutaPy expects NumPy arrays for X and y inputs. Passing Pandas DataFrames directly without converting to `.values` can lead to errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Upgrade to Boruta-Py version 0.4.0 or newer. If issues persist, consider downgrading NumPy to <1.24 (e.g., 1.23.1) or Python to <3.11, though this is less ideal than upgrading Boruta-Py itself.","message":"Older versions of Boruta-Py (prior to 0.4.x) may have compatibility issues with newer Python (e.g., 3.11+) and NumPy versions (e.g., 1.24+), specifically raising `AttributeError: module 'numpy' has no attribute 'int'`.","severity":"breaking","affected_versions":"<0.4.0"},{"fix":"Ensure your base estimator is tree-based (e.g., `RandomForestClassifier`, `ExtraTreesClassifier`) when using `n_estimators='auto'`. For non-tree estimators, explicitly set `n_estimators` to an integer value in `BorutaPy` and consider if Boruta is the appropriate method.","message":"Using `n_estimators='auto'` in `BorutaPy` with an underlying estimator that does not support an `n_estimators` parameter (e.g., `LogisticRegression`, `SVM`) will cause a `KeyError` or `ValueError`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Start with a smaller subset of data, consider reducing `max_iter`, or tune the base estimator's parameters (e.g., `max_depth` between 3-7 for `RandomForestClassifier` as recommended by the author) for faster execution during prototyping.","message":"Boruta-Py, especially on large datasets or with many features, can be computationally intensive and slow due to its iterative nature and reliance on ensemble methods.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Convert DataFrames to NumPy arrays using the `.values` attribute: `feat_selector.fit(X.values, y.values)`.","cause":"Input features (X) or target (y) were passed as Pandas DataFrames instead of NumPy arrays.","error":"TypeError: invalid key"},{"fix":"Upgrade Boruta-Py to version 0.4.0 or higher. If the problem persists, temporarily downgrade NumPy to a compatible version (e.g., `pip install numpy==1.23.1`).","cause":"This error typically occurs when Boruta-Py (especially older versions) uses `np.int` which was deprecated in NumPy 1.20 and removed in NumPy 1.24. It often appears with Python 3.11+ and NumPy 1.24+.","error":"AttributeError: module 'numpy' has no attribute 'int'. Did you mean: 'inf'?"},{"fix":"Ensure the `estimator` passed to `BorutaPy` is a tree-based model (e.g., `RandomForestClassifier`, `ExtraTreesClassifier`). If using a non-tree model, provide an integer value for `n_estimators` instead of `'auto'`.","cause":"This error can occur when `n_estimators='auto'` is used in `BorutaPy` with a non-tree-based estimator (e.g., `LogisticRegression`) that does not have a `max_depth` parameter, which `BorutaPy` tries to access internally.","error":"KeyError: 'max_depth'"}]}