ML Diagnostics Python SDK

1.0.2 · active · verified Thu Apr 16

The `google-cloud-mldiagnostics` library is the Python SDK for Google Cloud's ML Diagnostics platform. It integrates with machine learning workloads to collect and manage workload metrics, configurations, and profiles, and enables programmatic and on-demand profile capture. It helps users create and monitor machine learning runs, deploy managed XProf resources for performance profiling, and visualize various workload aspects on Google Cloud. The library is actively maintained, with frequent updates aligning with new features and improvements in Google Cloud services.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to integrate `google-cloud-mldiagnostics` into your Python ML workload. It sets up Cloud Logging, creates a machine learning run, records sample metrics, and writes configuration data. Ensure `GCP_PROJECT_ID` environment variable is set or replace `'your-gcp-project-id'` with your actual Google Cloud project ID. Authentication typically relies on Application Default Credentials (e.g., via `gcloud auth application-default login`).

import os
import logging
import google.cloud.logging
from google_cloud_mldiagnostics import machinelearning_run
from google_cloud_mldiagnostics import metrics
from google_cloud_mldiagnostics import xprof
from google_cloud_mldiagnostics.proto.diagnostics import MetricType

# Set up Cloud Logging (recommended)
logging_client = google.cloud.logging.Client()
logging_client.setup_logging()
logging.info("Cloud Logging is set up.")

PROJECT_ID = os.environ.get('GCP_PROJECT_ID', 'your-gcp-project-id')
# Ensure GOOGLE_APPLICATION_CREDENTIALS is set or authenticated via gcloud CLI

def run_ml_diagnostics_example():
    print(f"Using GCP Project ID: {PROJECT_ID}")
    
    # 1. Create a machine learning run
    # The SDK automatically generates a unique run_id if not provided.
    run_name = machinelearning_run.create_run(
        project_id=PROJECT_ID,
        experiment_name="my-first-experiment",
        display_name="my-training-run"
    )
    print(f"Created ML Run: {run_name}")

    # 2. Record metrics
    metrics.record(MetricType.LOSS, 0.5, step=1, run_name=run_name)
    metrics.record(MetricType.ACCURACY, 0.8, step=1, run_name=run_name)
    print("Recorded initial metrics.")

    metrics.record(MetricType.LOSS, 0.2, step=10, run_name=run_name)
    metrics.record(MetricType.ACCURACY, 0.95, step=10, run_name=run_name)
    print("Recorded updated metrics.")

    # 3. Write configurations (example)
    machinelearning_run.write_config(run_name, {"learning_rate": 0.01, "batch_size": 32})
    print("Wrote run configurations.")

    # Example of capturing a profile (requires XProf server running in your workload)
    # For on-demand capture, ensure xprof.start_server() is called in your ML workload.
    # xprof.capture_profile(run_name, 'gs://your-bucket/profiles', duration_ms=10000)
    # print("Attempted to capture profile.")

    print("ML Diagnostics example completed. Check Google Cloud Console for 'my-training-run'.")

if __name__ == '__main__':
    run_ml_diagnostics_example()

view raw JSON →