{"id":14490,"library":"cloud-accelerator-diagnostics","title":"Cloud Accelerator Diagnostics","description":"Cloud Accelerator Diagnostics is a Python library designed to monitor, debug, and profile workloads running on Cloud accelerators such as TPUs and GPUs. It provides a streamlined approach for automatically uploading diagnostic data to Tensorboard Experiments within Google Cloud's Vertex AI platform. The current version is 0.1.1, with related tools like `tpu-info` seeing more frequent updates, indicating active development in the ecosystem.","status":"active","version":"0.1.1","language":"en","source_language":"en","source_url":"https://github.com/AI-Hypercomputer/cloud-accelerator-diagnostics","tags":["cloud","tpu","gpu","diagnostics","monitoring","profiling","tensorboard","vertex-ai","google-cloud"],"install":[{"cmd":"pip install cloud-accelerator-diagnostics","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required for integrating with Vertex AI Tensorboard.","package":"google-cloud-aiplatform","optional":false},{"reason":"Underlies TPU utilization metrics, used by related diagnostic tools like `tpu-info` which can be installed from a subdirectory of the cloud-accelerator-diagnostics repository.","package":"libtpu","optional":true}],"imports":[{"note":"Main function for initiating Tensorboard uploads.","symbol":"start_upload_to_tensorboard","correct":"from cloud_accelerator_diagnostics.upload_to_tensorboard import start_upload_to_tensorboard"},{"note":"Function to gracefully shut down the Tensorboard upload thread.","symbol":"stop_upload_to_tensorboard","correct":"from cloud_accelerator_diagnostics.upload_to_tensorboard import stop_upload_to_tensorboard"}],"quickstart":{"code":"import os\nimport time\nfrom cloud_accelerator_diagnostics.upload_to_tensorboard import start_upload_to_tensorboard, stop_upload_to_tensorboard\n\n# Replace with your Google Cloud Project ID and desired Tensorboard instance/experiment names\nPROJECT_ID = os.environ.get('GCP_PROJECT_ID', 'your-gcp-project-id')\nREGION = os.environ.get('GCP_REGION', 'us-central1') # e.g., 'us-central1'\nTENSORBOARD_INSTANCE_NAME = 'test-instance'\nEXPERIMENT_NAME = 'my-accelerator-experiment'\nLOG_DIR = '/tmp/my_tpu_logs' # Directory where Tensorboard logs are written by your workload\n\n# Ensure the log directory exists\nos.makedirs(LOG_DIR, exist_ok=True)\n\nprint(f\"Starting Tensorboard uploader for project {PROJECT_ID} in region {REGION}...\")\nprint(f\"Logs from {LOG_DIR} will be uploaded to Tensorboard instance '{TENSORBOARD_INSTANCE_NAME}' and experiment '{EXPERIMENT_NAME}'.\")\n\ntry:\n    # Start the background thread to monitor log_dir and upload to Vertex AI Tensorboard.\n    # This will create the instance and experiment if they don't exist.\n    uploader_thread_handle = start_upload_to_tensorboard(\n        project_id=PROJECT_ID,\n        region=REGION,\n        tensorboard_instance_name=TENSORBOARD_INSTANCE_NAME,\n        experiment_name=EXPERIMENT_NAME,\n        logdir=LOG_DIR\n    )\n    print(\"Tensorboard uploader started. Running for 60 seconds...\")\n\n    # Simulate a workload generating logs (e.g., a JAX/PyTorch training loop)\n    # In a real scenario, your ML framework would write events to LOG_DIR.\n    # For this example, we'll just wait.\n    time.sleep(60)\n\n    print(\"Workload simulation complete.\")\n\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\nfinally:\n    # Ensure the uploader thread is gracefully shut down\n    print(\"Stopping Tensorboard uploader...\")\n    stop_upload_to_tensorboard(uploader_thread_handle)\n    print(\"Tensorboard uploader stopped.\")\n","lang":"python","description":"This quickstart demonstrates how to initialize and use `cloud-accelerator-diagnostics` to upload logs to a Vertex AI Tensorboard instance. It covers starting and stopping the background upload thread and highlights the necessary Google Cloud setup (API enablement, IAM roles)."},"warnings":[{"fix":"Enable 'Vertex AI API' in Google Cloud Console and assign 'Vertex AI User' role to your service account (e.g., `gcloud projects add-iam-policy-binding PROJECT_ID --member='serviceAccount:SERVICE_ACCOUNT_EMAIL' --role='roles/aiplatform.user'`)","message":"Before using the library with Vertex AI Tensorboard, ensure the Vertex AI API is enabled in your Google Cloud project and the service account executing the code has the 'Vertex AI User' IAM role. Failing to do so will result in permission errors.","severity":"gotcha","affected_versions":"All"},{"fix":"Enclose code using the uploader in a `try...finally` block, calling `stop_upload_to_tensorboard()` in the `finally` block, as shown in the quickstart example.","message":"The `start_upload_to_tensorboard()` function runs in a separate thread. Any code following its invocation that might raise exceptions should be wrapped in a `try...finally` block to ensure `stop_upload_to_tensorboard()` is called for proper thread shutdown.","severity":"gotcha","affected_versions":"All"},{"fix":"Refer to the `AI-Hypercomputer/cloud-accelerator-diagnostics` repository for some related source components, but be aware the main PyPI package's internal structure might differ or lack explicit documentation in a public Google-owned GitHub.","message":"This `cloud-accelerator-diagnostics` PyPI package refers to a GitHub repository that is either private or not the primary source for the code. The `tpu-info` CLI tool (a related diagnostic utility) can be installed from a subdirectory of `AI-Hypercomputer/cloud-accelerator-diagnostics`, which appears to be the more actively maintained public repository for related tools. This discrepancy might lead to confusion when seeking source code or comprehensive documentation.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Ensure the Vertex AI API is enabled and the executing principal has the 'Vertex AI User' (roles/aiplatform.user) IAM role on the Google Cloud project.","cause":"The service account or user credentials used to run the application do not have the necessary permissions (e.g., 'Vertex AI User' role) to create or access Vertex AI Tensorboard resources.","error":"google.api_core.exceptions.PermissionDenied: 403 Permission 'aiplatform.tensorboards.create' denied on resource"},{"fix":"Run `pip install cloud-accelerator-diagnostics` to install the package.","cause":"The `cloud-accelerator-diagnostics` package is not installed in the current Python environment.","error":"ModuleNotFoundError: No module named 'cloud_accelerator_diagnostics'"},{"fix":"Always capture the return value of `start_upload_to_tensorboard()` and pass it to `stop_upload_to_tensorboard()`. Ensure the `stop` function is called only once per `start`.","cause":"Attempting to call `stop_upload_to_tensorboard()` on a thread handle when the thread has not been properly started or is already stopped. More commonly, if `start_upload_to_tensorboard` was called without capturing its return value.","error":"Thread RuntimeError: cannot join current thread"}],"ecosystem":"pypi"}