{"id":7086,"library":"cloud-tpu-diagnostics","title":"Cloud TPU Diagnostics","description":"The `cloud-tpu-diagnostics` library provides tools to monitor, debug, and profile jobs running on Cloud TPUs. It captures Python stack traces upon faults (e.g., segmentation faults, floating-point exceptions) and periodically collects traces to diagnose unresponsive or hung programs. Currently at version 0.1.5, it maintains an active release cadence with frequent updates focused on stability and minor feature enhancements.","status":"active","version":"0.1.5","language":"en","source_language":"en","source_url":"https://github.com/google/cloud-tpu-monitoring-debugging","tags":["cloud-tpu","diagnostics","debugging","profiling","google-cloud"],"install":[{"cmd":"pip install cloud-tpu-diagnostics","lang":"bash","label":"Install on Cloud TPU VM"}],"dependencies":[],"imports":[{"symbol":"diagnostic","correct":"from cloud_tpu_diagnostics import diagnostic"},{"symbol":"debug_configuration","correct":"from cloud_tpu_diagnostics.configuration import debug_configuration"},{"symbol":"diagnostic_configuration","correct":"from cloud_tpu_diagnostics.configuration import diagnostic_configuration"},{"symbol":"stack_trace_configuration","correct":"from cloud_tpu_diagnostics.configuration import stack_trace_configuration"}],"quickstart":{"code":"import os\nfrom cloud_tpu_diagnostics import diagnostic\nfrom cloud_tpu_diagnostics.configuration import debug_configuration\nfrom cloud_tpu_diagnostics.configuration import diagnostic_configuration\nfrom cloud_tpu_diagnostics.configuration import stack_trace_configuration\n\ndef run_main_application():\n    print(\"Running main application logic...\")\n    # Simulate an error or a long-running process\n    # raise ValueError(\"Simulated error for diagnostics\")\n    print(\"Main application logic finished.\")\n\nstack_trace_config = stack_trace_configuration.StackTraceConfig(\n    collect_stack_trace=True,\n    stack_trace_to_cloud=True,  # Set to False to dump to console\n    stack_trace_interval_seconds=300 # Collect every 5 minutes\n)\ndebug_config = debug_configuration.DebugConfig(\n    stack_trace_config=stack_trace_config\n)\ndiagnostic_config = diagnostic_configuration.DiagnosticConfig(\n    debug_config=debug_config\n)\n\n# Wrap your main method with diagnose() to periodically collect stack traces\nwith diagnostic.diagnose(diagnostic_config):\n    run_main_application()\n\nprint(\"Diagnostics agent has stopped.\")\n","lang":"python","description":"To use the diagnostics, install the package on all Cloud TPU VMs and wrap your main application logic within `diagnostic.diagnose()` context manager. This example configures stack trace collection every 5 minutes and uploads them to Cloud Logging. Customize `collect_stack_trace` and `stack_trace_to_cloud` based on your debugging needs."},"warnings":[{"fix":"Upgrade to `cloud-tpu-diagnostics>=0.1.4` to benefit from fixes for graceful daemon thread exits and signal handling.","message":"Older versions (pre-0.1.4) had issues with daemon threads and signal handling, potentially leading to graceful exit failures or unexpected behavior in multi-threaded programs. Ensure you are on the latest version for improved stability.","severity":"gotcha","affected_versions":"<0.1.4"},{"fix":"Upgrade to `cloud-tpu-diagnostics>=0.1.1` to ensure robust stack trace collection for a wider range of exceptions.","message":"In version 0.1.1 and earlier, specific exceptions like `AssertionError` or `tensorflow.python.framework.errors_impl.NotFoundError` might not have triggered stack trace dumping to the console as expected when `collect_stack_trace=True` and `stack_trace_to_cloud=False`.","severity":"gotcha","affected_versions":"<0.1.1"},{"fix":"Ensure `pip install cloud-tpu-diagnostics` is run on all worker VMs (e.g., using `gcloud compute tpus tpu-vm ssh --worker=all ...`) and that the `diagnostic.diagnose()` context manager wraps the entry point of your application on all instances.","message":"The `cloud-tpu-diagnostics` package must be installed on *all* TPU VMs and the diagnostic code integrated into *all* scripts running on those VMs to ensure comprehensive and accurate diagnostics. Incomplete deployment can lead to missing or partial trace data.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Verify that you are running the correct and current JAX runtime version for your TPU VM. Check the Cloud TPU release notes and ensure your `gcloud` configuration specifies a compatible `--version` for your TPU type.","cause":"This error typically indicates that the Cloud TPU VM version is incorrect or incompatible with your JAX/TensorFlow runtime, or there's an underlying issue with the TPU hardware provisioning.","error":"RuntimeError: Unable to initialize backend 'tpu': UNAVAILABLE: No TPU Platform available."},{"fix":"Verify your Google Cloud project, zone, and TPU VM name. Check that your VPC network's firewall rules allow SSH (TCP:22) access. Ensure your SSH keys are correctly propagated or try running the `gcloud compute tpus tpu-vm ssh` command again.","cause":"This is a common network connectivity issue preventing SSH access to your TPU VM, often due to firewall rules, incorrect project/zone settings, or an SSH key propagation problem.","error":"ssh: connect to host X.X.X.X port 22: Connection timed out / ERROR: (gcloud.compute.tpus.tpu-vm.ssh) Could not SSH into the instance."},{"fix":"Ensure `stack_trace_config.collect_stack_trace` is set to `True`. If `stack_trace_to_cloud` is `True`, verify that the Cloud Logging agent is active and has the necessary permissions. Also, confirm that your application's `main` method is correctly wrapped by `diagnostic.diagnose()`. You can view logs in Logs Explorer using the query `logName=\"projects/<project_name>/logs/tpu.googleapis.com%2Fruntime_monitor\" jsonPayload.verb=\"stacktraceanalyzer\"`.","cause":"This can occur if `stack_trace_to_cloud` is `True` but the agent for uploading logs is not running, if `collect_stack_trace` is `False`, or if the diagnostic code is not properly integrated into your application.","error":"Missing or incomplete stack traces in Cloud Logging or `/tmp/debugging` directory."}]}