XGBoost
XGBoost (eXtreme Gradient Boosting) is an optimized, distributed gradient boosting library designed for efficiency, flexibility, and portability. It implements machine learning algorithms under the Gradient Boosting framework, excelling in tasks like classification, regression, and ranking. It currently stands at version 3.2.0, with frequent patch releases and major updates typically occurring every 6-12 months.
Warnings
- breaking In XGBoost 2.0, several parameters related to GPU usage (`gpu_id`, `gpu_hist`, `gpu_predictor`, `cpu_predictor`, `gpu_coord_descent`) were replaced by a single `device` parameter. The `hist` tree method also became the default. The `predictor` parameter was removed.
- breaking XGBoost 2.0 introduced breaking changes in prediction functions, renaming all data parameters to `X` for better scikit-learn estimator interface compliance. It also dropped the generation of pseudo-feature names for `np.ndarray` inputs to `DMatrix` and changed the default evaluation metric for `binary:logitraw` to `logloss`.
- breaking XGBoost 3.0 removed the deprecated `DeviceQuantileDMatrix`, dropped support for saving models in certain deprecated formats (though loading old models is still supported), and removed support for legacy (blocking) CUDA streams. It also now requires CUDA 12.0 or later.
- deprecated The `use_label_encoder` parameter in `XGBClassifier` and `XGBRegressor` was deprecated and is now effectively ignored (or will raise warnings/errors in older versions). Modern XGBoost handles label encoding internally.
- gotcha Blindly using default hyperparameters without tuning for your specific dataset is a common mistake, leading to sub-optimal model performance or overfitting/underfitting. XGBoost offers a wide range of parameters that need careful tuning.
- gotcha Ignoring class imbalance in your dataset can lead to models that perform poorly on the minority class, which is often the class of most interest. XGBoost provides mechanisms to address this.
Install
-
pip install xgboost
Imports
- xgboost
import xgboost as xgb
- XGBClassifier
from xgboost import XGBClassifier
- XGBRegressor
from xgboost import XGBRegressor
- DMatrix
import xgboost as xgb dtrain = xgb.DMatrix(data, label=label)
Quickstart
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# 1. Load data
iris = load_iris()
X, y = iris.data, iris.target
# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. Initialize and train an XGBoost Classifier
model = xgb.XGBClassifier(objective='multi:softprob', # Multi-class classification with probability output
num_class=len(iris.target_names),
eval_metric='mlogloss',
use_label_encoder=False # Suppress deprecation warning in older versions
)
model.fit(X_train, y_train)
# 4. Make predictions
y_pred = model.predict(X_test)
# 5. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Native API example (alternative to Scikit-learn API)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
params = {
'objective': 'multi:softprob',
'num_class': len(iris.target_names),
'eval_metric': 'mlogloss'
}
num_round = 100
bst = xgb.train(params, dtrain, num_round, evals=[(dtest, 'test')])
preds_native = bst.predict(dtest)
# For multi:softprob, preds_native are probabilities, need argmax for class labels
preds_class_native = [p.argmax() for p in preds_native]
accuracy_native = accuracy_score(y_test, preds_class_native)
print(f"Native API Accuracy: {accuracy_native:.2f}")