XGBoost (CPU Version)
XGBoost is an optimized distributed gradient boosting library designed for speed and performance. The `xgboost-cpu` package serves as a convenience installer, providing the core XGBoost library (version 2.0.3 as of `xgboost-cpu==3.2.0`) compiled with CPU-only optimizations. This meta-package, currently at version 3.2.0, facilitates specific build installations and is updated alongside major XGBoost releases.
Common errors
-
ModuleNotFoundError: No module named 'xgboost'
cause The `xgboost` library is not installed in the currently active Python environment, or the environment is not correctly activated.fixEnsure you have installed the correct package using `pip install xgboost-cpu`. If using `conda`, use `conda install -c conda-forge xgboost`. -
ValueError: feature_names mismatch
cause This commonly occurs when loading a pre-trained model and attempting to predict with a DataFrame that has different feature names or column order than the data used for training the model.fixVerify that the feature names and their order in your prediction data match those used during training. Use `model.get_booster().feature_names` to inspect the expected names. If using `DMatrix`, ensure `feature_names` are passed correctly during its creation. -
XGBoostError: DMatrix is not allowed to be empty
cause Attempting to create a `DMatrix` or train an XGBoost model with an empty dataset (e.g., input data with 0 rows or 0 columns, or all NaN values).fixCheck your input data (features and labels) to ensure it is not empty, does not contain entirely null values, and has valid dimensions before passing it to `DMatrix` or the model's `fit` method. -
TypeError: 'str' object cannot be interpreted as an integer
cause A parameter that expects an integer value (e.g., `n_jobs`, `num_boost_round`, `random_state`) was passed a string. This can happen when reading configuration from environment variables or text files without proper type conversion.fixConvert the parameter value to an integer type. For example, use `n_jobs=int(os.environ.get('XGB_N_JOBS', '-1'))` or ensure direct assignments are `n_estimators=100` instead of `n_estimators='100'`.
Warnings
- breaking The `output_margin` parameter for prediction was removed in XGBoost 2.0. Use `pred_contribs=True` or other specific output types if needed.
- breaking XGBoost 2.0 introduced significant changes to the model save/load format (`XGBoost.json` vs. older binary format) and the Python package structure. Models saved with older versions might not be directly loadable or usable with XGBoost 2.x without conversion.
- gotcha The `xgboost-cpu` package version (e.g., 3.2.0) refers to a distribution/installer version, not the core `xgboost` library version it installs. As of `xgboost-cpu==3.2.0`, it installs `xgboost==2.0.3`. Always check `xgboost.__version__` after installation to confirm the underlying library version.
- deprecated The `gpu_id` parameter for specifying GPU devices has been deprecated and replaced by the more general `device` parameter (e.g., `device='cuda:0'`, `device='cpu'`).
- gotcha For optimal performance and memory efficiency with very large datasets, especially when using the raw C++ API via `xgboost.train` or `xgboost.cv` directly, it is highly recommended to use `xgboost.DMatrix` objects instead of raw NumPy arrays or Pandas DataFrames.
Install
-
pip install xgboost-cpu
Imports
- XGBClassifier
from xgboost import XGBClassifier
- XGBRegressor
from xgboost import XGBRegressor
- DMatrix
from xgboost import DMatrix
- train
from xgboost import train
Quickstart
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
import os
# 1. Generate synthetic data for a classification task
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 2. Initialize the XGBoost classifier
# 'objective' specifies the learning task.
# 'eval_metric' defines the metric for evaluation during training.
# `use_label_encoder=False` is recommended for XGBoost 1.x and 2.x to avoid a deprecation warning.
# `n_jobs` can be set to -1 to use all available CPU cores, or a specific number.
model = xgb.XGBClassifier(
objective='binary:logistic',
eval_metric='logloss',
n_estimators=100,
learning_rate=0.1,
max_depth=5,
use_label_encoder=False, # Required for older versions to silence warning
n_jobs=int(os.environ.get('XGB_N_JOBS', '-1')), # Example of using env var for N_JOBS
random_state=42
)
# 3. Train the model
model.fit(X_train, y_train)
# 4. Make predictions on the test set
y_pred = model.predict(X_test)
# 5. Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")