CatBoost

1.2.10 · active · verified Thu Apr 09

CatBoost is a high-performance open-source gradient boosting library developed by Yandex. It stands out for its innovative handling of categorical features using ordered boosting and target-mean encoding, which helps prevent overfitting. It also offers fast training and prediction, support for GPU, and advanced visualization tools. The current version is 1.2.10, with frequent minor releases that often include performance improvements and new features.

Warnings

Install

Imports

Quickstart

This example demonstrates how to train a `CatBoostClassifier` on a synthetic dataset, correctly specifying categorical features using the `Pool` object. It then makes predictions and retrieves probabilities, highlighting CatBoost's typical workflow and efficient data handling.

from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0, random_state=42)
# Let's pretend the first two features are categorical (e.g., integer-encoded)
categorical_features_indices = [0, 1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoostClassifier
model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    loss_function='Logloss',
    eval_metric='Accuracy',
    random_seed=42,
    verbose=False # Suppress training output for quickstart
)

# Create CatBoost Pool objects, explicitly specifying categorical features
train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
test_pool = Pool(X_test, y_test, cat_features=categorical_features_indices)

# Train the model
model.fit(train_pool, eval_set=test_pool)

# Make predictions
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)

print(f"Model accuracy on test set: {model.get_best_score()['validation']['Accuracy']:.4f}")
print(f"First 5 predictions: {predictions[:5].tolist()}")
print(f"First 5 probabilities (class 0, class 1): {probabilities[:5].tolist()}")

view raw JSON →