ML Goodput Measurement
ML Goodput Measurement (ml-goodput-measurement) is a Python library designed to monitor and analyze the efficiency of Machine Learning (ML) workloads. It tracks metrics such as Goodput, Badput, and step time deviation, integrating with Google Cloud Logging, Google Cloud Monitoring, and TensorBoard for data storage, visualization, and alerting. The library is actively maintained, with minor version releases occurring roughly monthly or bi-monthly, and is currently at version 0.0.16.
Common errors
-
Cumulative Goodput metrics appear incorrect or include data from previous runs.
cause The `job_name` or `goodput_logger_name` was reused for a new experiment or job run, causing the system to aggregate data from multiple distinct workloads.fixAssign a unique `job_name` and `logger_name` for every new experiment or training job to prevent data mixing and ensure accurate, isolated metric collection. -
GoodputMonitor is not uploading metrics or is throwing multiprocessing-related errors after upgrading.
cause The `GoodputMonitor`'s internal implementation switched from multithreading to multiprocessing in `v0.0.15`, which changes how it interacts with the parent process and system resources.fixReview your application's concurrency model. Ensure that any shared resources or state are handled appropriately for multiprocessing (e.g., using `multiprocessing.Queue` or `Manager` objects for inter-process communication if necessary), or isolate the monitoring process fully. -
Metrics are not visible in Google Cloud Monitoring or TensorBoard despite `GoodputMonitor` running.
cause This is typically due to incorrect Google Cloud permissions, an invalid `logger_name` or `tensorboard_dir`, or an issue with network connectivity to Google Cloud services.fixVerify that your Google Cloud service account has the `logging.logWriter` and `monitoring.metricWriter` roles, and write access to the specified GCS bucket for TensorBoard logs. Double-check `logger_name` and `tensorboard_dir` parameters for correctness.
Warnings
- breaking The `GoodputMonitor` implementation was refactored in `v0.0.15` to use `multiprocessing` instead of `multithreading` for asynchronous metric uploads.
- gotcha Reusing `job_name` or `logger_name` across different experiments or job runs within the same Google Cloud project can lead to inaccurate cumulative Goodput metrics.
- gotcha Full functionality relies on a properly configured Google Cloud project, including enabled billing and necessary access scopes for Google Cloud Logging, Google Cloud Monitoring, and Google Cloud Storage (for TensorBoard).
Install
-
pip install ml-goodput-measurement
Imports
- GoodputMonitor
from goodput import GoodputMonitor
- GoodputRecorder
from goodput import GoodputRecorder
- GoodputCalculator
from goodput import GoodputCalculator
Quickstart
import os
import time
from goodput import GoodputMonitor, GoodputRecorder
# --- Configuration Parameters (replace with your actual values) ---
# IMPORTANT: Use unique run_name and logger_name for each experiment to avoid data corruption.
JOB_NAME = os.environ.get('GOODPUT_JOB_NAME', 'my-ml-training-job-unique-id')
LOGGER_NAME = os.environ.get('GOODPUT_LOGGER_NAME', f'goodput_{JOB_NAME}')
TENSORBOARD_DIR = os.environ.get('GOODPUT_TENSORBOARD_DIR', '/tmp/tensorboard_logs')
# --- Initialize Recorder to log productive steps ---
recorder = GoodputRecorder(
job_name=JOB_NAME,
logger_name=LOGGER_NAME,
logging_enabled=True # Set to True to enable logging to Google Cloud Logging
)
# --- Simulate a training loop ---
print(f"Starting ML workload: {JOB_NAME}")
for step in range(10):
recorder.log_productive_step(step)
print(f"Completed productive step {step}")
time.sleep(1) # Simulate productive work
# Simulate some unproductive time or I/O
if step % 3 == 0:
time.sleep(0.5)
recorder.log_unproductive_time("data_loading", 0.5)
# --- Initialize and start GoodputMonitor for asynchronous upload ---
# The monitor runs in a separate process and uploads to TensorBoard and Google Cloud Monitoring.
monitor = GoodputMonitor(
job_name=JOB_NAME,
logger_name=LOGGER_NAME,
tensorboard_dir=TENSORBOARD_DIR,
upload_interval=30, # Upload every 30 seconds
monitoring_enabled=True, # Set to True to enable monitoring
include_badput_breakdown=True
)
monitor.start_goodput_uploader()
print("Goodput Monitor started. Metrics will be uploaded asynchronously.")
# Continue with the rest of your training job...
# For demonstration, let's keep it running for a bit
time.sleep(65)
# --- Stop the monitor when the job is done ---
monitor.stop_goodput_uploader()
recorder.log_job_completion()
print("ML workload finished. Goodput Monitor stopped.")