XGBoost
raw JSON → 3.2.0 verified Tue May 12 auth: no python install: draft quickstart: stale
XGBoost (eXtreme Gradient Boosting) is an optimized, distributed gradient boosting library designed for efficiency, flexibility, and portability. It implements machine learning algorithms under the Gradient Boosting framework, excelling in tasks like classification, regression, and ranking. It currently stands at version 3.2.0, with frequent patch releases and major updates typically occurring every 6-12 months.
pip install xgboost Common errors
error ModuleNotFoundError: No module named 'xgboost' ↓
cause The `xgboost` library is not installed in the current Python environment.
fix
pip install xgboost
error ValueError: Invalid DMatrix: xgboost.DMatrix is required for input data. ↓
cause The `xgboost.train` function or some core XGBoost C API methods require input data to be an `xgboost.DMatrix` object, but a raw data structure like a NumPy array or Pandas DataFrame was provided.
fix
import xgboost as xgb
import numpy as np
# Assuming X and y are NumPy arrays or Pandas DataFrames
dmatrix_data = xgb.DMatrix(X, label=y)
# Use dmatrix_data for xgb.train or other DMatrix-requiring functions
error ValueError: feature_names mismatch ↓
cause The feature names or their order in the input data provided for prediction or evaluation do not match those used when the XGBoost model was trained.
fix
import pandas as pd
# Assuming X_train was a DataFrame used for model training
# Ensure X_predict has the same columns in the same order as X_train
X_predict = pd.DataFrame(some_new_data, columns=X_train.columns)
model.predict(X_predict)
error XGBoostError: Unknown parameter: <parameter_name> ↓
cause An unrecognized or misspelled parameter was passed to an XGBoost model constructor (e.g., `XGBClassifier`) or the `xgb.train` function.
fix
# Check the official XGBoost documentation for available parameters
# Correct a common typo (e.g., 'sub_sample' instead of 'subsample'):
# Incorrect:
# model = xgb.XGBClassifier(sub_sample=0.8)
# Correct:
model = xgb.XGBClassifier(subsample=0.8)
# Note: Parameters for xgb.train are passed in a dictionary, while for XGBClassifier/XGBRegressor they are keyword arguments.
Warnings
breaking In XGBoost 2.0, several parameters related to GPU usage (`gpu_id`, `gpu_hist`, `gpu_predictor`, `cpu_predictor`, `gpu_coord_descent`) were replaced by a single `device` parameter. The `hist` tree method also became the default. The `predictor` parameter was removed. ↓
fix Migrate deprecated GPU-related parameters to the new `device` parameter (e.g., `device='cuda'` for GPU). Review and update code that explicitly set `tree_method` or relied on the old `predictor` parameter.
breaking XGBoost 2.0 introduced breaking changes in prediction functions, renaming all data parameters to `X` for better scikit-learn estimator interface compliance. It also dropped the generation of pseudo-feature names for `np.ndarray` inputs to `DMatrix` and changed the default evaluation metric for `binary:logitraw` to `logloss`. ↓
fix Update calls to prediction functions to use `X` for data arguments. If your code relied on auto-generated feature names from NumPy arrays, you might need to provide them explicitly. Verify model evaluation logic, especially for binary classification with raw logistic output.
breaking XGBoost 3.0 removed the deprecated `DeviceQuantileDMatrix`, dropped support for saving models in certain deprecated formats (though loading old models is still supported), and removed support for legacy (blocking) CUDA streams. It also now requires CUDA 12.0 or later. ↓
fix Remove usages of `DeviceQuantileDMatrix`. If you have code saving models in old formats, consider updating to current supported formats. Ensure your CUDA environment meets the minimum requirement of 12.0+ if using GPU acceleration.
deprecated The `use_label_encoder` parameter in `XGBClassifier` and `XGBRegressor` was deprecated and is now effectively ignored (or will raise warnings/errors in older versions). Modern XGBoost handles label encoding internally. ↓
fix Remove `use_label_encoder=True/False` from `XGBClassifier` or `XGBRegressor` initializations. If you encounter warnings in older code, setting `use_label_encoder=False` was a common workaround.
gotcha Blindly using default hyperparameters without tuning for your specific dataset is a common mistake, leading to sub-optimal model performance or overfitting/underfitting. XGBoost offers a wide range of parameters that need careful tuning. ↓
fix Always perform hyperparameter tuning using techniques like Grid Search, Random Search, or more advanced optimization frameworks (e.g., Optuna) tailored to your dataset's characteristics. Focus on `learning_rate`, `max_depth`, `min_child_weight`, `subsample`, `colsample_bytree`, and `gamma`.
gotcha Ignoring class imbalance in your dataset can lead to models that perform poorly on the minority class, which is often the class of most interest. XGBoost provides mechanisms to address this. ↓
fix For binary classification, use the `scale_pos_weight` parameter. For multi-class problems, consider `class_weight`. Supplement these with sampling techniques like SMOTE if necessary.
gotcha When using XGBoost with common machine learning workflows, libraries like scikit-learn are frequently used for tasks such as data splitting (`train_test_split`), preprocessing, and evaluation metrics. A `ModuleNotFoundError` for such libraries indicates that they are not installed in the current Python environment. ↓
fix Ensure all necessary external libraries, such as scikit-learn, are installed in your Python environment. You can typically install scikit-learn using `pip install scikit-learn` or by including it in your project's `requirements.txt` file.
gotcha Installing XGBoost on certain Linux distributions, especially minimal ones like Alpine, requires build tools like 'cmake' and a C++ compiler. Without these, the installation process will fail with a 'FileNotFoundError' for 'cmake' or similar compilation errors. ↓
fix Ensure 'cmake' and a C++ compiler (e.g., g++ or clang) are installed on your system before attempting to install XGBoost. For Alpine Linux, this typically involves `apk add cmake g++`.
Install compatibility draft last tested: 2026-05-12
python os / libc status wheel install import disk
3.10 alpine (musl) build_error - - - -
3.10 alpine (musl) - - - -
3.10 slim (glibc) wheel 17.7s 0.95s 843M
3.10 slim (glibc) - - 0.71s 833M
3.11 alpine (musl) build_error - - - -
3.11 alpine (musl) - - - -
3.11 slim (glibc) wheel 19.1s 1.42s 857M
3.11 slim (glibc) - - 1.16s 846M
3.12 alpine (musl) build_error - - - -
3.12 alpine (musl) - - - -
3.12 slim (glibc) wheel 18.5s 1.39s 843M
3.12 slim (glibc) - - 1.25s 832M
3.13 alpine (musl) build_error - - - -
3.13 alpine (musl) - - - -
3.13 slim (glibc) wheel 16.6s 1.29s 842M
3.13 slim (glibc) - - 1.23s 831M
3.9 alpine (musl) build_error - - - -
3.9 alpine (musl) - - - -
3.9 slim (glibc) wheel 20.6s 0.81s 955M
3.9 slim (glibc) - - 0.69s 944M
Imports
- xgboost
import xgboost as xgb - XGBClassifier
from xgboost import XGBClassifier - XGBRegressor
from xgboost import XGBRegressor - DMatrix wrong
from xgboost import DMatrixcorrectimport xgboost as xgb dtrain = xgb.DMatrix(data, label=label)
Quickstart stale last tested: 2026-04-24
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# 1. Load data
iris = load_iris()
X, y = iris.data, iris.target
# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. Initialize and train an XGBoost Classifier
model = xgb.XGBClassifier(objective='multi:softprob', # Multi-class classification with probability output
num_class=len(iris.target_names),
eval_metric='mlogloss',
use_label_encoder=False # Suppress deprecation warning in older versions
)
model.fit(X_train, y_train)
# 4. Make predictions
y_pred = model.predict(X_test)
# 5. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Native API example (alternative to Scikit-learn API)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
params = {
'objective': 'multi:softprob',
'num_class': len(iris.target_names),
'eval_metric': 'mlogloss'
}
num_round = 100
bst = xgb.train(params, dtrain, num_round, evals=[(dtest, 'test')])
preds_native = bst.predict(dtest)
# For multi:softprob, preds_native are probabilities, need argmax for class labels
preds_class_native = [p.argmax() for p in preds_native]
accuracy_native = accuracy_score(y_test, preds_class_native)
print(f"Native API Accuracy: {accuracy_native:.2f}")