Cloud Accelerator Diagnostics
Cloud Accelerator Diagnostics is a Python library designed to monitor, debug, and profile workloads running on Cloud accelerators such as TPUs and GPUs. It provides a streamlined approach for automatically uploading diagnostic data to Tensorboard Experiments within Google Cloud's Vertex AI platform. The current version is 0.1.1, with related tools like `tpu-info` seeing more frequent updates, indicating active development in the ecosystem.
Common errors
-
google.api_core.exceptions.PermissionDenied: 403 Permission 'aiplatform.tensorboards.create' denied on resource
cause The service account or user credentials used to run the application do not have the necessary permissions (e.g., 'Vertex AI User' role) to create or access Vertex AI Tensorboard resources.fixEnsure the Vertex AI API is enabled and the executing principal has the 'Vertex AI User' (roles/aiplatform.user) IAM role on the Google Cloud project. -
ModuleNotFoundError: No module named 'cloud_accelerator_diagnostics'
cause The `cloud-accelerator-diagnostics` package is not installed in the current Python environment.fixRun `pip install cloud-accelerator-diagnostics` to install the package. -
Thread RuntimeError: cannot join current thread
cause Attempting to call `stop_upload_to_tensorboard()` on a thread handle when the thread has not been properly started or is already stopped. More commonly, if `start_upload_to_tensorboard` was called without capturing its return value.fixAlways capture the return value of `start_upload_to_tensorboard()` and pass it to `stop_upload_to_tensorboard()`. Ensure the `stop` function is called only once per `start`.
Warnings
- gotcha Before using the library with Vertex AI Tensorboard, ensure the Vertex AI API is enabled in your Google Cloud project and the service account executing the code has the 'Vertex AI User' IAM role. Failing to do so will result in permission errors.
- gotcha The `start_upload_to_tensorboard()` function runs in a separate thread. Any code following its invocation that might raise exceptions should be wrapped in a `try...finally` block to ensure `stop_upload_to_tensorboard()` is called for proper thread shutdown.
- gotcha This `cloud-accelerator-diagnostics` PyPI package refers to a GitHub repository that is either private or not the primary source for the code. The `tpu-info` CLI tool (a related diagnostic utility) can be installed from a subdirectory of `AI-Hypercomputer/cloud-accelerator-diagnostics`, which appears to be the more actively maintained public repository for related tools. This discrepancy might lead to confusion when seeking source code or comprehensive documentation.
Install
-
pip install cloud-accelerator-diagnostics
Imports
- start_upload_to_tensorboard
from cloud_accelerator_diagnostics.upload_to_tensorboard import start_upload_to_tensorboard
- stop_upload_to_tensorboard
from cloud_accelerator_diagnostics.upload_to_tensorboard import stop_upload_to_tensorboard
Quickstart
import os
import time
from cloud_accelerator_diagnostics.upload_to_tensorboard import start_upload_to_tensorboard, stop_upload_to_tensorboard
# Replace with your Google Cloud Project ID and desired Tensorboard instance/experiment names
PROJECT_ID = os.environ.get('GCP_PROJECT_ID', 'your-gcp-project-id')
REGION = os.environ.get('GCP_REGION', 'us-central1') # e.g., 'us-central1'
TENSORBOARD_INSTANCE_NAME = 'test-instance'
EXPERIMENT_NAME = 'my-accelerator-experiment'
LOG_DIR = '/tmp/my_tpu_logs' # Directory where Tensorboard logs are written by your workload
# Ensure the log directory exists
os.makedirs(LOG_DIR, exist_ok=True)
print(f"Starting Tensorboard uploader for project {PROJECT_ID} in region {REGION}...")
print(f"Logs from {LOG_DIR} will be uploaded to Tensorboard instance '{TENSORBOARD_INSTANCE_NAME}' and experiment '{EXPERIMENT_NAME}'.")
try:
# Start the background thread to monitor log_dir and upload to Vertex AI Tensorboard.
# This will create the instance and experiment if they don't exist.
uploader_thread_handle = start_upload_to_tensorboard(
project_id=PROJECT_ID,
region=REGION,
tensorboard_instance_name=TENSORBOARD_INSTANCE_NAME,
experiment_name=EXPERIMENT_NAME,
logdir=LOG_DIR
)
print("Tensorboard uploader started. Running for 60 seconds...")
# Simulate a workload generating logs (e.g., a JAX/PyTorch training loop)
# In a real scenario, your ML framework would write events to LOG_DIR.
# For this example, we'll just wait.
time.sleep(60)
print("Workload simulation complete.")
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Ensure the uploader thread is gracefully shut down
print("Stopping Tensorboard uploader...")
stop_upload_to_tensorboard(uploader_thread_handle)
print("Tensorboard uploader stopped.")