Cloud TPU Diagnostics
The `cloud-tpu-diagnostics` library provides tools to monitor, debug, and profile jobs running on Cloud TPUs. It captures Python stack traces upon faults (e.g., segmentation faults, floating-point exceptions) and periodically collects traces to diagnose unresponsive or hung programs. Currently at version 0.1.5, it maintains an active release cadence with frequent updates focused on stability and minor feature enhancements.
Common errors
-
RuntimeError: Unable to initialize backend 'tpu': UNAVAILABLE: No TPU Platform available.
cause This error typically indicates that the Cloud TPU VM version is incorrect or incompatible with your JAX/TensorFlow runtime, or there's an underlying issue with the TPU hardware provisioning.fixVerify that you are running the correct and current JAX runtime version for your TPU VM. Check the Cloud TPU release notes and ensure your `gcloud` configuration specifies a compatible `--version` for your TPU type. -
ssh: connect to host X.X.X.X port 22: Connection timed out / ERROR: (gcloud.compute.tpus.tpu-vm.ssh) Could not SSH into the instance.
cause This is a common network connectivity issue preventing SSH access to your TPU VM, often due to firewall rules, incorrect project/zone settings, or an SSH key propagation problem.fixVerify your Google Cloud project, zone, and TPU VM name. Check that your VPC network's firewall rules allow SSH (TCP:22) access. Ensure your SSH keys are correctly propagated or try running the `gcloud compute tpus tpu-vm ssh` command again. -
Missing or incomplete stack traces in Cloud Logging or `/tmp/debugging` directory.
cause This can occur if `stack_trace_to_cloud` is `True` but the agent for uploading logs is not running, if `collect_stack_trace` is `False`, or if the diagnostic code is not properly integrated into your application.fixEnsure `stack_trace_config.collect_stack_trace` is set to `True`. If `stack_trace_to_cloud` is `True`, verify that the Cloud Logging agent is active and has the necessary permissions. Also, confirm that your application's `main` method is correctly wrapped by `diagnostic.diagnose()`. You can view logs in Logs Explorer using the query `logName="projects/<project_name>/logs/tpu.googleapis.com%2Fruntime_monitor" jsonPayload.verb="stacktraceanalyzer"`.
Warnings
- gotcha Older versions (pre-0.1.4) had issues with daemon threads and signal handling, potentially leading to graceful exit failures or unexpected behavior in multi-threaded programs. Ensure you are on the latest version for improved stability.
- gotcha In version 0.1.1 and earlier, specific exceptions like `AssertionError` or `tensorflow.python.framework.errors_impl.NotFoundError` might not have triggered stack trace dumping to the console as expected when `collect_stack_trace=True` and `stack_trace_to_cloud=False`.
- gotcha The `cloud-tpu-diagnostics` package must be installed on *all* TPU VMs and the diagnostic code integrated into *all* scripts running on those VMs to ensure comprehensive and accurate diagnostics. Incomplete deployment can lead to missing or partial trace data.
Install
-
pip install cloud-tpu-diagnostics
Imports
- diagnostic
from cloud_tpu_diagnostics import diagnostic
- debug_configuration
from cloud_tpu_diagnostics.configuration import debug_configuration
- diagnostic_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
- stack_trace_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration
Quickstart
import os
from cloud_tpu_diagnostics import diagnostic
from cloud_tpu_diagnostics.configuration import debug_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration
def run_main_application():
print("Running main application logic...")
# Simulate an error or a long-running process
# raise ValueError("Simulated error for diagnostics")
print("Main application logic finished.")
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace=True,
stack_trace_to_cloud=True, # Set to False to dump to console
stack_trace_interval_seconds=300 # Collect every 5 minutes
)
debug_config = debug_configuration.DebugConfig(
stack_trace_config=stack_trace_config
)
diagnostic_config = diagnostic_configuration.DiagnosticConfig(
debug_config=debug_config
)
# Wrap your main method with diagnose() to periodically collect stack traces
with diagnostic.diagnose(diagnostic_config):
run_main_application()
print("Diagnostics agent has stopped.")