Cloud TPU Diagnostics

0.1.5 · active · verified Thu Apr 16

The `cloud-tpu-diagnostics` library provides tools to monitor, debug, and profile jobs running on Cloud TPUs. It captures Python stack traces upon faults (e.g., segmentation faults, floating-point exceptions) and periodically collects traces to diagnose unresponsive or hung programs. Currently at version 0.1.5, it maintains an active release cadence with frequent updates focused on stability and minor feature enhancements.

Common errors

Warnings

Install

Imports

Quickstart

To use the diagnostics, install the package on all Cloud TPU VMs and wrap your main application logic within `diagnostic.diagnose()` context manager. This example configures stack trace collection every 5 minutes and uploads them to Cloud Logging. Customize `collect_stack_trace` and `stack_trace_to_cloud` based on your debugging needs.

import os
from cloud_tpu_diagnostics import diagnostic
from cloud_tpu_diagnostics.configuration import debug_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration

def run_main_application():
    print("Running main application logic...")
    # Simulate an error or a long-running process
    # raise ValueError("Simulated error for diagnostics")
    print("Main application logic finished.")

stack_trace_config = stack_trace_configuration.StackTraceConfig(
    collect_stack_trace=True,
    stack_trace_to_cloud=True,  # Set to False to dump to console
    stack_trace_interval_seconds=300 # Collect every 5 minutes
)
debug_config = debug_configuration.DebugConfig(
    stack_trace_config=stack_trace_config
)
diagnostic_config = diagnostic_configuration.DiagnosticConfig(
    debug_config=debug_config
)

# Wrap your main method with diagnose() to periodically collect stack traces
with diagnostic.diagnose(diagnostic_config):
    run_main_application()

print("Diagnostics agent has stopped.")

view raw JSON →