Cloud Accelerator Diagnostics

0.1.1 · active · verified Thu Apr 16

Cloud Accelerator Diagnostics is a Python library designed to monitor, debug, and profile workloads running on Cloud accelerators such as TPUs and GPUs. It provides a streamlined approach for automatically uploading diagnostic data to Tensorboard Experiments within Google Cloud's Vertex AI platform. The current version is 0.1.1, with related tools like `tpu-info` seeing more frequent updates, indicating active development in the ecosystem.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize and use `cloud-accelerator-diagnostics` to upload logs to a Vertex AI Tensorboard instance. It covers starting and stopping the background upload thread and highlights the necessary Google Cloud setup (API enablement, IAM roles).

import os
import time
from cloud_accelerator_diagnostics.upload_to_tensorboard import start_upload_to_tensorboard, stop_upload_to_tensorboard

# Replace with your Google Cloud Project ID and desired Tensorboard instance/experiment names
PROJECT_ID = os.environ.get('GCP_PROJECT_ID', 'your-gcp-project-id')
REGION = os.environ.get('GCP_REGION', 'us-central1') # e.g., 'us-central1'
TENSORBOARD_INSTANCE_NAME = 'test-instance'
EXPERIMENT_NAME = 'my-accelerator-experiment'
LOG_DIR = '/tmp/my_tpu_logs' # Directory where Tensorboard logs are written by your workload

# Ensure the log directory exists
os.makedirs(LOG_DIR, exist_ok=True)

print(f"Starting Tensorboard uploader for project {PROJECT_ID} in region {REGION}...")
print(f"Logs from {LOG_DIR} will be uploaded to Tensorboard instance '{TENSORBOARD_INSTANCE_NAME}' and experiment '{EXPERIMENT_NAME}'.")

try:
    # Start the background thread to monitor log_dir and upload to Vertex AI Tensorboard.
    # This will create the instance and experiment if they don't exist.
    uploader_thread_handle = start_upload_to_tensorboard(
        project_id=PROJECT_ID,
        region=REGION,
        tensorboard_instance_name=TENSORBOARD_INSTANCE_NAME,
        experiment_name=EXPERIMENT_NAME,
        logdir=LOG_DIR
    )
    print("Tensorboard uploader started. Running for 60 seconds...")

    # Simulate a workload generating logs (e.g., a JAX/PyTorch training loop)
    # In a real scenario, your ML framework would write events to LOG_DIR.
    # For this example, we'll just wait.
    time.sleep(60)

    print("Workload simulation complete.")

except Exception as e:
    print(f"An error occurred: {e}")
finally:
    # Ensure the uploader thread is gracefully shut down
    print("Stopping Tensorboard uploader...")
    stop_upload_to_tensorboard(uploader_thread_handle)
    print("Tensorboard uploader stopped.")

view raw JSON →