CloudML HyperTune
Cloudml-hypertune is a lightweight Python library providing helper functions to report hyperparameter tuning metrics to Google Cloud's Vertex AI (formerly Cloud ML Engine). It enables the hyperparameter tuning service to track and optimize model training trials by collecting objective metrics. Despite its `0.1.0.dev6` version being quite old (last released December 2019), it remains the standard way to report custom metrics for hyperparameter tuning on Google Cloud.
Common errors
-
Hyperparameter tuning failed
cause The hyperparameter tuning service could not retrieve the objective metric from the training job. This often happens due to a mismatch between the `hyperparameter_metric_tag` in your code and the metric configuration in Vertex AI.fixVerify that the `hyperparameter_metric_tag` used in `hpt.report_hyperparameter_tuning_metric()` exactly matches the `metric_id` specified in your Vertex AI hyperparameter tuning job configuration. Also ensure the `report_hyperparameter_tuning_metric` call is actually reached and executed during training. -
Trials show status 'Failed' but logs don't show Python errors
cause The training code might be completing without explicitly reporting a metric, or the metric is being reported incorrectly. Older TensorFlow Estimator-based training might not be correctly outputting to event files in a format recognized by the tuning service if `cloudml-hypertune` isn't used.fixEnsure `hpt.report_hyperparameter_tuning_metric` is called with a valid `metric_value` and `hyperparameter_metric_tag` at the end of each evaluation step or epoch. For non-TensorFlow models or custom evaluation loops, `cloudml-hypertune` is the definitive way to report. -
Google Cloud ML Engine does not return objective values when hyperparameter tuning
cause The `cloudml-hypertune` library's `report_hyperparameter_tuning_metric` function was not invoked or did not successfully transmit the metric. This could be due to an uncaught exception in the training code preventing the call, or an incorrect metric tag.fixDebug your training script to ensure `hpt.report_hyperparameter_tuning_metric` is called. Add logging around this call to confirm its execution and the values being passed. Double-check the `hyperparameter_metric_tag` against your Vertex AI job configuration.
Warnings
- gotcha The `cloudml-hypertune` library is specifically designed to work within Google Cloud's hyperparameter tuning services (Vertex AI/AI Platform). It is not a standalone hyperparameter tuning framework and will not function as such outside of this cloud environment for driving optimization.
- breaking Hyperparameters you wish to tune *must* be exposed as command-line arguments in your training script. The Vertex AI tuning service passes trial-specific hyperparameter values via these arguments, not directly to the `hypertune` library.
- gotcha The `hyperparameter_metric_tag` passed to `hpt.report_hyperparameter_tuning_metric` must exactly match the `metric_id` (or `hyperparameterMetricTag` in older configs) specified in your Vertex AI hyperparameter tuning job configuration. A mismatch will result in trials failing to report metrics or the tuning job not recognizing the objective.
- gotcha Version `0.1.0.dev6` was released in December 2019. While still functional and referenced in Google Cloud documentation for Vertex AI, its lack of recent updates may lead to assumptions of abandonment or compatibility issues with very new Python features, though it's a very stable, minimal utility.
- gotcha Training runs that result in `NaN` values in loss functions or other unhandled exceptions will cause hyperparameter tuning trials to fail. This not only wastes resources but also prevents the tuning algorithm from learning from that trial.
Install
-
pip install cloudml-hypertune
Imports
- HyperTune
from cloudml_hypertune import HyperTune
import hypertune hpt = hypertune.HyperTune()
Quickstart
import hypertune
import argparse
import os
def train_model(learning_rate, num_epochs, metric_tag):
# Simulate model training with hyperparameters
# In a real scenario, this would be your ML training loop
print(f"Training with learning_rate={learning_rate}, num_epochs={num_epochs}")
# Simulate a metric, e.g., validation accuracy
# In a real scenario, you'd get this from your model's evaluation
metric_value = 0.5 + (learning_rate * 0.1) + (num_epochs * 0.01)
# Report the metric to Cloud AI Platform / Vertex AI
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag=metric_tag, # Must match config.yaml objective metricTag
metric_value=metric_value,
global_step=num_epochs # Or current training step
)
print(f"Reported metric '{metric_tag}': {metric_value} at step {num_epochs}")
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# Define hyperparameters as command-line arguments
parser.add_argument(
'--learning_rate',
type=float,
default=0.01,
help='Learning rate for training.'
)
parser.add_argument(
'--num_epochs',
type=int,
default=10,
help='Number of epochs for training.'
)
parser.add_argument(
'--metric_tag',
type=str,
default='accuracy',
help='Tag for the metric reported to HyperTune.'
)
args = parser.parse_args()
train_model(args.learning_rate, args.num_epochs, args.metric_tag)