{"id":5098,"library":"xgboost-ray","title":"XGBoost-Ray","description":"XGBoost-Ray provides a Ray backend for distributed XGBoost, enabling training and prediction on Ray clusters with minimal code changes. It extends the core XGBoost API to leverage distributed data representations and integrates seamlessly with other Ray libraries like Ray Tune for hyperparameter optimization and Ray Train for scalable ML workloads. The library is actively maintained, with frequent updates to ensure compatibility with recent XGBoost and Ray versions.","status":"active","version":"0.1.19","language":"en","source_language":"en","source_url":"https://github.com/ray-project/xgboost_ray","tags":["machine learning","distributed computing","xgboost","ray","gradient boosting","scalability"],"install":[{"cmd":"pip install xgboost-ray","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core machine learning library for gradient boosting. xgboost-ray acts as a distributed backend for it, with strong version coupling.","package":"xgboost","optional":false},{"reason":"Distributed execution framework that xgboost-ray uses to scale training and prediction across clusters.","package":"ray","optional":false},{"reason":"Used in common examples for loading datasets; not a direct runtime dependency for core xgboost-ray functionality.","package":"scikit-learn","optional":true}],"imports":[{"note":"xgboost_ray.RayDMatrix is required for data handling in a Ray distributed environment, replacing the native xgboost.DMatrix.","wrong":"from xgboost import DMatrix","symbol":"RayDMatrix","correct":"from xgboost_ray import RayDMatrix"},{"note":"Used to configure Ray-specific distributed training parameters, such as the number of actors.","symbol":"RayParams","correct":"from xgboost_ray import RayParams"},{"note":"The distributed training function provided by xgboost-ray, a drop-in replacement for xgboost.train.","symbol":"train","correct":"from xgboost_ray import train"},{"note":"The distributed prediction function for models trained with xgboost-ray.","symbol":"predict","correct":"from xgboost_ray import predict"}],"quickstart":{"code":"import ray\nfrom xgboost_ray import RayDMatrix, RayParams, train\nfrom sklearn.datasets import load_breast_cancer\nimport os\n\n# Initialize Ray (if not already running)\nif ray.is_initialized():\n    ray.shutdown()\nray.init(log_to_stdout=False) # log_to_stdout=False to suppress verbose Ray output in quickstart\n\n# Load data\ntrain_x, train_y = load_breast_cancer(return_X_y=True)\n\n# Create RayDMatrix for distributed data handling\ntrain_set = RayDMatrix(train_x, train_y)\n\n# Configure Ray-specific training parameters\nray_params = RayParams(\n    num_actors=2,\n    cpus_per_actor=1\n)\n\n# Train the model using the xgboost-ray distributed train function\nevals_result = {}\nbst = train(\n    {\n        \"objective\": \"binary:logistic\",\n        \"eval_metric\": [\"logloss\", \"error\"],\n    },\n    train_set,\n    evals_result=evals_result,\n    evals=[(train_set, \"train\")],\n    verbose_eval=False,\n    ray_params=ray_params\n)\n\nprint(f\"Final training error: {evals_result['train']['error'][-1]:.4f}\")\n\n# Shutdown Ray cluster\nray.shutdown()","lang":"python","description":"This quickstart demonstrates how to train a distributed XGBoost model using `xgboost-ray`. It initializes a local Ray cluster, loads a sample dataset, prepares it as a `RayDMatrix`, configures `RayParams` for distributed execution, and then calls `xgboost_ray.train`."},"warnings":[{"fix":"Ensure your `xgboost` version is compatible with `xgboost-ray` (typically XGBoost 2.0+ for recent `xgboost-ray` versions). Review XGBoost 2.0 release notes for core API changes.","message":"XGBoost-Ray v0.1.19 updated its API to work with XGBoost 2.0. This may introduce breaking changes if you are upgrading `xgboost-ray` and relying on specific API behaviors from older XGBoost versions.","severity":"breaking","affected_versions":">=0.1.19"},{"fix":"Upgrade your `xgboost` package to version 1.7.0 or newer when using `xgboost-ray` versions 0.1.12 and above.","message":"XGBoost-Ray v0.1.12 introduced compatibility for `xgboost>=1.7.0`. Using `xgboost-ray` with `xgboost` versions older than 1.7.0 may lead to unexpected behavior or errors.","severity":"breaking","affected_versions":"<0.1.12"},{"fix":"Change your import and data matrix creation from `xgb.DMatrix(...)` to `RayDMatrix(...)`.","message":"Always use `xgboost_ray.RayDMatrix` instead of `xgboost.DMatrix` when passing data to `xgboost_ray.train`. `RayDMatrix` is essential for distributed data handling and sharding across Ray actors.","severity":"gotcha","affected_versions":"All"},{"fix":"Set `RayParams(num_actors=..., cpus_per_actor=...)` based on your cluster resources and workload. Consider reserving some CPU for Ray Data if performing heavy data operations.","message":"Explicitly configure `num_actors` and `cpus_per_actor` in `RayParams`. While `xgboost-ray` attempts to auto-configure, manual specification is crucial for optimal performance, especially in heterogeneous clusters or multi-GPU setups. Ensure enough CPUs are available for Ray Data operations if used.","severity":"gotcha","affected_versions":"All"},{"fix":"Pass `distributed=False` to `RayDMatrix` (e.g., `RayDMatrix(data, labels, distributed=False)`) to have the head node shard the data into the Ray object store.","message":"When using `RayDMatrix` with data sources that cannot be naturally sharded (e.g., a single large Parquet file), you may encounter a `RuntimeError` about insufficient shards. In such cases, enable centralized loading.","severity":"gotcha","affected_versions":"All"},{"fix":"Either specify `object_store_memory` in `ray.init()` to limit the size, or set the environment variable `RAY_ENABLE_MAC_LARGE_OBJECT_STORE=1` to ignore the warning (use with caution).","message":"On macOS, Ray's performance can degrade if the object store size exceeds 2.0GB. This may lead to warnings or slower execution for large datasets.","severity":"gotcha","affected_versions":"All (macOS users)"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}