ML Goodput Measurement

0.0.16 · active · verified Thu Apr 16

ML Goodput Measurement (ml-goodput-measurement) is a Python library designed to monitor and analyze the efficiency of Machine Learning (ML) workloads. It tracks metrics such as Goodput, Badput, and step time deviation, integrating with Google Cloud Logging, Google Cloud Monitoring, and TensorBoard for data storage, visualization, and alerting. The library is actively maintained, with minor version releases occurring roughly monthly or bi-monthly, and is currently at version 0.0.16.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `GoodputRecorder` to log productive steps and `GoodputMonitor` to asynchronously query and upload Goodput metrics. It simulates a basic ML training loop. Ensure you have a Google Cloud project with billing enabled and appropriate permissions for Cloud Logging, Cloud Monitoring, and GCS (for TensorBoard logs) before running in a real environment. Replace placeholder environment variables with actual unique values for production use.

import os
import time
from goodput import GoodputMonitor, GoodputRecorder

# --- Configuration Parameters (replace with your actual values) ---
# IMPORTANT: Use unique run_name and logger_name for each experiment to avoid data corruption.
JOB_NAME = os.environ.get('GOODPUT_JOB_NAME', 'my-ml-training-job-unique-id')
LOGGER_NAME = os.environ.get('GOODPUT_LOGGER_NAME', f'goodput_{JOB_NAME}')
TENSORBOARD_DIR = os.environ.get('GOODPUT_TENSORBOARD_DIR', '/tmp/tensorboard_logs')

# --- Initialize Recorder to log productive steps ---
recorder = GoodputRecorder(
    job_name=JOB_NAME,
    logger_name=LOGGER_NAME,
    logging_enabled=True  # Set to True to enable logging to Google Cloud Logging
)

# --- Simulate a training loop ---
print(f"Starting ML workload: {JOB_NAME}")
for step in range(10):
    recorder.log_productive_step(step)
    print(f"Completed productive step {step}")
    time.sleep(1) # Simulate productive work
    # Simulate some unproductive time or I/O
    if step % 3 == 0:
        time.sleep(0.5)
        recorder.log_unproductive_time("data_loading", 0.5)

# --- Initialize and start GoodputMonitor for asynchronous upload ---
# The monitor runs in a separate process and uploads to TensorBoard and Google Cloud Monitoring.
monitor = GoodputMonitor(
    job_name=JOB_NAME,
    logger_name=LOGGER_NAME,
    tensorboard_dir=TENSORBOARD_DIR,
    upload_interval=30, # Upload every 30 seconds
    monitoring_enabled=True, # Set to True to enable monitoring
    include_badput_breakdown=True
)
monitor.start_goodput_uploader()

print("Goodput Monitor started. Metrics will be uploaded asynchronously.")
# Continue with the rest of your training job...
# For demonstration, let's keep it running for a bit
time.sleep(65) 

# --- Stop the monitor when the job is done ---
monitor.stop_goodput_uploader()
recorder.log_job_completion()
print("ML workload finished. Goodput Monitor stopped.")

view raw JSON →