YDF (Yggdrasil Decision Forests)
YDF (short for Yggdrasil Decision Forests) is a library for training, serving, evaluating, and analyzing decision forest models such as Random Forest and Gradient Boosted Trees. It acts as a lightweight, efficient wrapper around the C++ Yggdrasil Decision Forests library. YDF is the official successor to TensorFlow Decision Forests (TF-DF) and is recommended for new projects due to its superior performance and features. It is actively developed with frequent releases.
Common errors
-
ModuleNotFoundError: No module named 'ydf'
cause The YDF library has not been installed in your current Python environment.fixRun `pip install ydf` to install the library. -
TypeError: 'str' object cannot be interpreted as an integer (or similar data-related TypeErrors/ValueErrors during training)
cause The input data (e.g., a Pandas DataFrame or CSV) contains unexpected data types in columns that YDF cannot automatically interpret or use for training the specified task, or the 'label' column is missing/incorrectly specified.fixReview your dataset for column data types and ensure they are appropriate for the features you intend to use. Verify that the `label` parameter in your learner (e.g., `ydf.GradientBoostedTreesLearner(label="your_target_column")`) correctly points to an existing column with suitable data for the task (classification, regression, etc.). YDF handles missing values automatically, but specific preprocessing might still be beneficial for certain data types or tasks. -
AttributeError: 'Model' object has no attribute 'to_tensorflow_saved_model'
cause You are trying to use an outdated or incorrect method to export a YDF model to TensorFlow SavedModel format, likely from an older TF-DF pattern, without the `ydf-tf` package installed or correctly imported.fixInstall the `ydf-tf` package (`pip install ydf-tf`) and consult the latest YDF documentation for the correct way to export models to TensorFlow SavedModel format using the `ydf-tf` integration. The `mode="keras"` option for direct export is deprecated. -
RuntimeError: Learner 'GRADIENT_BOOSTED_TREES' requires at least one feature. Check 'exclude_non_specified_features' and 'features' arguments.
cause No valid input features were provided or automatically detected for training after excluding the label column and any manually excluded features.fixEnsure your training dataset contains columns other than the label that YDF can use as features. If you are explicitly defining features, double-check your `features` argument to the learner. YDF usually auto-detects features, so this often indicates an empty feature set after exclusions.
Warnings
- breaking The method `model.to_tensorflow_saved_model(mode="keras")` is strongly discouraged and will be removed in a future version. Exporting YDF models to TensorFlow SavedModel now primarily uses the separate `ydf-tf` package.
- breaking Support for Python 3.8 was removed, and the package moved to `manylinux_2_28`.
- gotcha Adding new columns, reordering existing columns, or slight changes in input data can lead to different model outcomes due to the stochastic nature of some training components (e.g., feature sampling) and the pseudo-random number generator's initialization. YDF training is deterministic given identical inputs and version.
- gotcha The `verbose` parameter in learners (e.g., `GradientBoostedTreesLearner`) controls the amount of logging. The default (`verbose=1`) might produce extensive output in notebooks or consoles, potentially obscuring important information.
- deprecated The loss metric `LAMBDA_MART_NDCG5` has been renamed to `LAMBDA_MART_NDCG`.
Install
-
pip install ydf -U
Imports
- ydf
import ydf
- GradientBoostedTreesLearner
from ydf.learner import GradientBoostedTreesLearner
import ydf model = ydf.GradientBoostedTreesLearner(...)
Quickstart
import ydf
import pandas as pd
import os
# Load dataset with Pandas
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset/"
try:
train_ds = pd.read_csv(f"{ds_path}adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}adult_test.csv")
except Exception as e:
print(f"Could not load datasets: {e}. Ensure internet connection or provide local paths.")
exit()
# Train a Gradient Boosted Trees model
# 'label' is the target column for prediction.
# verbose=0 to suppress training logs for cleaner output, default is 1.
model = ydf.GradientBoostedTreesLearner(label="income", verbose=0).train(train_ds)
# Evaluate the model
print("Model Evaluation:")
print(model.evaluate(test_ds))
# Generate predictions
predictions = model.predict(test_ds)
print("\nFirst 5 predictions:")
print(predictions.head())
# Save and Load the model
model_path = "/tmp/my_ydf_model"
model.save(model_path)
loaded_model = ydf.load_model(model_path)
print(f"\nModel saved to '{model_path}' and reloaded successfully.")