{"id":7416,"library":"ml-goodput-measurement","title":"ML Goodput Measurement","description":"ML Goodput Measurement (ml-goodput-measurement) is a Python library designed to monitor and analyze the efficiency of Machine Learning (ML) workloads. It tracks metrics such as Goodput, Badput, and step time deviation, integrating with Google Cloud Logging, Google Cloud Monitoring, and TensorBoard for data storage, visualization, and alerting. The library is actively maintained, with minor version releases occurring roughly monthly or bi-monthly, and is currently at version 0.0.16.","status":"active","version":"0.0.16","language":"en","source_language":"en","source_url":"https://github.com/AI-Hypercomputer/ml-goodput-measurement","tags":["ml","monitoring","goodput","badput","google-cloud","performance","tpu"],"install":[{"cmd":"pip install ml-goodput-measurement","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required for logging productive step time and total job run time to Google Cloud Logging.","package":"google-cloud-logging","optional":true},{"reason":"Required for uploading Goodput and Badput metrics to TensorBoard.","package":"tensorflow","optional":true},{"reason":"Required for automatically sending performance data (cumulative/rolling window Goodput/Badput) to Google Cloud Monitoring.","package":"google-cloud-monitoring","optional":true}],"imports":[{"note":"Primary class for asynchronously querying and uploading ML Goodput metrics.","symbol":"GoodputMonitor","correct":"from goodput import GoodputMonitor"},{"note":"Class for exposing APIs to export key timestamps (e.g., productive step time) to Cloud Logging.","symbol":"GoodputRecorder","correct":"from goodput import GoodputRecorder"},{"note":"Class for computing Goodput based on recorded data, often used in analysis programs.","symbol":"GoodputCalculator","correct":"from goodput import GoodputCalculator"}],"quickstart":{"code":"import os\nimport time\nfrom goodput import GoodputMonitor, GoodputRecorder\n\n# --- Configuration Parameters (replace with your actual values) ---\n# IMPORTANT: Use unique run_name and logger_name for each experiment to avoid data corruption.\nJOB_NAME = os.environ.get('GOODPUT_JOB_NAME', 'my-ml-training-job-unique-id')\nLOGGER_NAME = os.environ.get('GOODPUT_LOGGER_NAME', f'goodput_{JOB_NAME}')\nTENSORBOARD_DIR = os.environ.get('GOODPUT_TENSORBOARD_DIR', '/tmp/tensorboard_logs')\n\n# --- Initialize Recorder to log productive steps ---\nrecorder = GoodputRecorder(\n    job_name=JOB_NAME,\n    logger_name=LOGGER_NAME,\n    logging_enabled=True  # Set to True to enable logging to Google Cloud Logging\n)\n\n# --- Simulate a training loop ---\nprint(f\"Starting ML workload: {JOB_NAME}\")\nfor step in range(10):\n    recorder.log_productive_step(step)\n    print(f\"Completed productive step {step}\")\n    time.sleep(1) # Simulate productive work\n    # Simulate some unproductive time or I/O\n    if step % 3 == 0:\n        time.sleep(0.5)\n        recorder.log_unproductive_time(\"data_loading\", 0.5)\n\n# --- Initialize and start GoodputMonitor for asynchronous upload ---\n# The monitor runs in a separate process and uploads to TensorBoard and Google Cloud Monitoring.\nmonitor = GoodputMonitor(\n    job_name=JOB_NAME,\n    logger_name=LOGGER_NAME,\n    tensorboard_dir=TENSORBOARD_DIR,\n    upload_interval=30, # Upload every 30 seconds\n    monitoring_enabled=True, # Set to True to enable monitoring\n    include_badput_breakdown=True\n)\nmonitor.start_goodput_uploader()\n\nprint(\"Goodput Monitor started. Metrics will be uploaded asynchronously.\")\n# Continue with the rest of your training job...\n# For demonstration, let's keep it running for a bit\ntime.sleep(65) \n\n# --- Stop the monitor when the job is done ---\nmonitor.stop_goodput_uploader()\nrecorder.log_job_completion()\nprint(\"ML workload finished. Goodput Monitor stopped.\")","lang":"python","description":"This quickstart demonstrates how to use `GoodputRecorder` to log productive steps and `GoodputMonitor` to asynchronously query and upload Goodput metrics. It simulates a basic ML training loop. Ensure you have a Google Cloud project with billing enabled and appropriate permissions for Cloud Logging, Cloud Monitoring, and GCS (for TensorBoard logs) before running in a real environment. Replace placeholder environment variables with actual unique values for production use."},"warnings":[{"fix":"If your application relied on `GoodputMonitor` sharing memory or specific threading behaviors with your main application, you may need to adjust your concurrency model. `multiprocessing` uses separate process spaces, which impacts shared state.","message":"The `GoodputMonitor` implementation was refactored in `v0.0.15` to use `multiprocessing` instead of `multithreading` for asynchronous metric uploads.","severity":"breaking","affected_versions":">=0.0.15"},{"fix":"Always use a unique `job_name` and `goodput_logger_name` for each individual experiment or workload you intend to monitor separately to ensure accurate cumulative metrics.","message":"Reusing `job_name` or `logger_name` across different experiments or job runs within the same Google Cloud project can lead to inaccurate cumulative Goodput metrics.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Before deployment, ensure your GCP project has the required APIs enabled and service accounts have permissions to write to Cloud Logging, write custom metrics to Cloud Monitoring, and write to the specified TensorBoard GCS bucket.","message":"Full functionality relies on a properly configured Google Cloud project, including enabled billing and necessary access scopes for Google Cloud Logging, Google Cloud Monitoring, and Google Cloud Storage (for TensorBoard).","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Assign a unique `job_name` and `logger_name` for every new experiment or training job to prevent data mixing and ensure accurate, isolated metric collection.","cause":"The `job_name` or `goodput_logger_name` was reused for a new experiment or job run, causing the system to aggregate data from multiple distinct workloads.","error":"Cumulative Goodput metrics appear incorrect or include data from previous runs."},{"fix":"Review your application's concurrency model. Ensure that any shared resources or state are handled appropriately for multiprocessing (e.g., using `multiprocessing.Queue` or `Manager` objects for inter-process communication if necessary), or isolate the monitoring process fully.","cause":"The `GoodputMonitor`'s internal implementation switched from multithreading to multiprocessing in `v0.0.15`, which changes how it interacts with the parent process and system resources.","error":"GoodputMonitor is not uploading metrics or is throwing multiprocessing-related errors after upgrading."},{"fix":"Verify that your Google Cloud service account has the `logging.logWriter` and `monitoring.metricWriter` roles, and write access to the specified GCS bucket for TensorBoard logs. Double-check `logger_name` and `tensorboard_dir` parameters for correctness.","cause":"This is typically due to incorrect Google Cloud permissions, an invalid `logger_name` or `tensorboard_dir`, or an issue with network connectivity to Google Cloud services.","error":"Metrics are not visible in Google Cloud Monitoring or TensorBoard despite `GoodputMonitor` running."}]}