XGBoost-Ray
XGBoost-Ray provides a Ray backend for distributed XGBoost, enabling training and prediction on Ray clusters with minimal code changes. It extends the core XGBoost API to leverage distributed data representations and integrates seamlessly with other Ray libraries like Ray Tune for hyperparameter optimization and Ray Train for scalable ML workloads. The library is actively maintained, with frequent updates to ensure compatibility with recent XGBoost and Ray versions.
Warnings
- breaking XGBoost-Ray v0.1.19 updated its API to work with XGBoost 2.0. This may introduce breaking changes if you are upgrading `xgboost-ray` and relying on specific API behaviors from older XGBoost versions.
- breaking XGBoost-Ray v0.1.12 introduced compatibility for `xgboost>=1.7.0`. Using `xgboost-ray` with `xgboost` versions older than 1.7.0 may lead to unexpected behavior or errors.
- gotcha Always use `xgboost_ray.RayDMatrix` instead of `xgboost.DMatrix` when passing data to `xgboost_ray.train`. `RayDMatrix` is essential for distributed data handling and sharding across Ray actors.
- gotcha Explicitly configure `num_actors` and `cpus_per_actor` in `RayParams`. While `xgboost-ray` attempts to auto-configure, manual specification is crucial for optimal performance, especially in heterogeneous clusters or multi-GPU setups. Ensure enough CPUs are available for Ray Data operations if used.
- gotcha When using `RayDMatrix` with data sources that cannot be naturally sharded (e.g., a single large Parquet file), you may encounter a `RuntimeError` about insufficient shards. In such cases, enable centralized loading.
- gotcha On macOS, Ray's performance can degrade if the object store size exceeds 2.0GB. This may lead to warnings or slower execution for large datasets.
Install
-
pip install xgboost-ray
Imports
- RayDMatrix
from xgboost_ray import RayDMatrix
- RayParams
from xgboost_ray import RayParams
- train
from xgboost_ray import train
- predict
from xgboost_ray import predict
Quickstart
import ray
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer
import os
# Initialize Ray (if not already running)
if ray.is_initialized():
ray.shutdown()
ray.init(log_to_stdout=False) # log_to_stdout=False to suppress verbose Ray output in quickstart
# Load data
train_x, train_y = load_breast_cancer(return_X_y=True)
# Create RayDMatrix for distributed data handling
train_set = RayDMatrix(train_x, train_y)
# Configure Ray-specific training parameters
ray_params = RayParams(
num_actors=2,
cpus_per_actor=1
)
# Train the model using the xgboost-ray distributed train function
evals_result = {}
bst = train(
{
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
},
train_set,
evals_result=evals_result,
evals=[(train_set, "train")],
verbose_eval=False,
ray_params=ray_params
)
print(f"Final training error: {evals_result['train']['error'][-1]:.4f}")
# Shutdown Ray cluster
ray.shutdown()