{"id":1411,"library":"catboost","title":"CatBoost","description":"CatBoost is a high-performance open-source gradient boosting library developed by Yandex. It stands out for its innovative handling of categorical features using ordered boosting and target-mean encoding, which helps prevent overfitting. It also offers fast training and prediction, support for GPU, and advanced visualization tools. The current version is 1.2.10, with frequent minor releases that often include performance improvements and new features.","status":"active","version":"1.2.10","language":"en","source_language":"en","source_url":"https://github.com/catboost/catboost","tags":["machine learning","gradient boosting","classification","regression","categorical features","yandex","gpu"],"install":[{"cmd":"pip install catboost","lang":"bash","label":"CPU or GPU (if CUDA toolkit is installed)"}],"dependencies":[],"imports":[{"symbol":"CatBoostClassifier","correct":"from catboost import CatBoostClassifier"},{"symbol":"CatBoostRegressor","correct":"from catboost import CatBoostRegressor"},{"note":"The Pool object is crucial for efficient data handling, especially with categorical features or large datasets, ensuring optimal performance and correct feature interpretation.","symbol":"Pool","correct":"from catboost import Pool"}],"quickstart":{"code":"from catboost import CatBoostClassifier, Pool\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.datasets import make_classification\n\n# Generate a synthetic dataset\nX, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0, random_state=42)\n# Let's pretend the first two features are categorical (e.g., integer-encoded)\ncategorical_features_indices = [0, 1]\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n# Initialize CatBoostClassifier\nmodel = CatBoostClassifier(\n    iterations=100,\n    learning_rate=0.1,\n    depth=6,\n    loss_function='Logloss',\n    eval_metric='Accuracy',\n    random_seed=42,\n    verbose=False # Suppress training output for quickstart\n)\n\n# Create CatBoost Pool objects, explicitly specifying categorical features\ntrain_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)\ntest_pool = Pool(X_test, y_test, cat_features=categorical_features_indices)\n\n# Train the model\nmodel.fit(train_pool, eval_set=test_pool)\n\n# Make predictions\npredictions = model.predict(X_test)\nprobabilities = model.predict_proba(X_test)\n\nprint(f\"Model accuracy on test set: {model.get_best_score()['validation']['Accuracy']:.4f}\")\nprint(f\"First 5 predictions: {predictions[:5].tolist()}\")\nprint(f\"First 5 probabilities (class 0, class 1): {probabilities[:5].tolist()}\")","lang":"python","description":"This example demonstrates how to train a `CatBoostClassifier` on a synthetic dataset, correctly specifying categorical features using the `Pool` object. It then makes predictions and retrieves probabilities, highlighting CatBoost's typical workflow and efficient data handling."},"warnings":[{"fix":"Upgrade Python to 3.8+ or Spark to 3.x+. Alternatively, pin `catboost<1.2.8` in your project dependencies.","message":"Support for Python 3.7 and Apache Spark 2.x was officially dropped in CatBoost version 1.2.8. Users relying on these older environments will need to upgrade their Python or Spark, or pin an older CatBoost version.","severity":"breaking","affected_versions":">=1.2.8"},{"fix":"To get probabilities for the positive class (class 1), access the second column: `probabilities[:, 1]`.","message":"For binary classification tasks, `model.predict_proba()` returns a 2D array of shape `(n_samples, 2)`, where columns represent probabilities for class 0 and class 1 respectively. Users commonly expect a 1D array of positive class probabilities.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure `catboost` is version `1.2.8` or newer for full `numpy 2.x` compatibility. If using an older `catboost` version, explicitly pin `numpy<2.0` in your environment.","message":"There was a period of flux regarding `numpy` 2.x compatibility. While `catboost>=1.2.8` now officially supports `numpy 2.x`, earlier versions (e.g., 1.2.6) explicitly prohibited it, which could cause dependency conflicts or runtime issues if `numpy` was upgraded prematurely.","severity":"gotcha","affected_versions":"1.2.6 to 1.2.7 (partially), Fixed in >=1.2.8"},{"fix":"When creating a `Pool` object or calling `model.fit()`, pass a list of indices or names of your categorical columns to the `cat_features` argument.","message":"CatBoost's strength lies in its handling of categorical features, but they must be explicitly identified (e.g., using the `cat_features` parameter in `Pool` or `fit`). Failing to do so will cause them to be treated as numerical, potentially leading to suboptimal model performance or incorrect results.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}