CatBoost
CatBoost is a high-performance open-source gradient boosting library developed by Yandex. It stands out for its innovative handling of categorical features using ordered boosting and target-mean encoding, which helps prevent overfitting. It also offers fast training and prediction, support for GPU, and advanced visualization tools. The current version is 1.2.10, with frequent minor releases that often include performance improvements and new features.
Warnings
- breaking Support for Python 3.7 and Apache Spark 2.x was officially dropped in CatBoost version 1.2.8. Users relying on these older environments will need to upgrade their Python or Spark, or pin an older CatBoost version.
- gotcha For binary classification tasks, `model.predict_proba()` returns a 2D array of shape `(n_samples, 2)`, where columns represent probabilities for class 0 and class 1 respectively. Users commonly expect a 1D array of positive class probabilities.
- gotcha There was a period of flux regarding `numpy` 2.x compatibility. While `catboost>=1.2.8` now officially supports `numpy 2.x`, earlier versions (e.g., 1.2.6) explicitly prohibited it, which could cause dependency conflicts or runtime issues if `numpy` was upgraded prematurely.
- gotcha CatBoost's strength lies in its handling of categorical features, but they must be explicitly identified (e.g., using the `cat_features` parameter in `Pool` or `fit`). Failing to do so will cause them to be treated as numerical, potentially leading to suboptimal model performance or incorrect results.
Install
-
pip install catboost
Imports
- CatBoostClassifier
from catboost import CatBoostClassifier
- CatBoostRegressor
from catboost import CatBoostRegressor
- Pool
from catboost import Pool
Quickstart
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0, random_state=42)
# Let's pretend the first two features are categorical (e.g., integer-encoded)
categorical_features_indices = [0, 1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize CatBoostClassifier
model = CatBoostClassifier(
iterations=100,
learning_rate=0.1,
depth=6,
loss_function='Logloss',
eval_metric='Accuracy',
random_seed=42,
verbose=False # Suppress training output for quickstart
)
# Create CatBoost Pool objects, explicitly specifying categorical features
train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
test_pool = Pool(X_test, y_test, cat_features=categorical_features_indices)
# Train the model
model.fit(train_pool, eval_set=test_pool)
# Make predictions
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
print(f"Model accuracy on test set: {model.get_best_score()['validation']['Accuracy']:.4f}")
print(f"First 5 predictions: {predictions[:5].tolist()}")
print(f"First 5 probabilities (class 0, class 1): {probabilities[:5].tolist()}")