XGBoost-Ray

0.1.19 · active · verified Sun Apr 12

XGBoost-Ray provides a Ray backend for distributed XGBoost, enabling training and prediction on Ray clusters with minimal code changes. It extends the core XGBoost API to leverage distributed data representations and integrates seamlessly with other Ray libraries like Ray Tune for hyperparameter optimization and Ray Train for scalable ML workloads. The library is actively maintained, with frequent updates to ensure compatibility with recent XGBoost and Ray versions.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to train a distributed XGBoost model using `xgboost-ray`. It initializes a local Ray cluster, loads a sample dataset, prepares it as a `RayDMatrix`, configures `RayParams` for distributed execution, and then calls `xgboost_ray.train`.

import ray
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer
import os

# Initialize Ray (if not already running)
if ray.is_initialized():
    ray.shutdown()
ray.init(log_to_stdout=False) # log_to_stdout=False to suppress verbose Ray output in quickstart

# Load data
train_x, train_y = load_breast_cancer(return_X_y=True)

# Create RayDMatrix for distributed data handling
train_set = RayDMatrix(train_x, train_y)

# Configure Ray-specific training parameters
ray_params = RayParams(
    num_actors=2,
    cpus_per_actor=1
)

# Train the model using the xgboost-ray distributed train function
evals_result = {}
bst = train(
    {
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    train_set,
    evals_result=evals_result,
    evals=[(train_set, "train")],
    verbose_eval=False,
    ray_params=ray_params
)

print(f"Final training error: {evals_result['train']['error'][-1]:.4f}")

# Shutdown Ray cluster
ray.shutdown()

view raw JSON →