NVIDIA One-Logger Training Telemetry

2.3.1 · active · verified Thu Apr 16

The `nv-one-logger-training-telemetry` library provides tools for capturing and reporting training job telemetry data, integrating with the `one-logger` ecosystem. It enables standardized logging of metrics, hyperparameters, and system information for AI/ML training runs. The current version is 2.3.1 and it is part of the NVIDIA one-logger project, following its release cadence.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize `TelemetryClient` with a `TelemetryConfig` and log common training data such as hyperparameters, metrics over steps, and artifact paths. It includes example environment variable usage for configurability and handles potential connection errors for backend services.

from nvtelemetry.client import TelemetryClient
from nvtelemetry.config import TelemetryConfig
from datetime import datetime
import os

# Example configuration (adjust as needed for actual usage)
# In a real environment, project and run_id might be set via environment variables
# or a more complex configuration management system.

# Using os.environ.get for dynamic values, falling back to defaults for example
project_name = os.environ.get("ONE_LOGGER_PROJECT_NAME", "my_ml_project_example")
run_id = os.environ.get("ONE_LOGGER_RUN_ID", f"run_{datetime.now().strftime('%Y%m%d%H%M%S')}")

config = TelemetryConfig(
    project=project_name,
    model="my_model_v1",
    run_id=run_id,
    framework="pytorch",
    framework_version="1.13.1",
    container_image="nvcr.io/nvidia/pytorch:23.05-py3",
    tags={
        "experiment": "initial_test",
        "dataset": "cifar10"
    },
    mlflow_tracking_uri=os.environ.get("ONE_LOGGER_MLFLOW_TRACKING_URI", "") # Example for MLflow backend
)

try:
    # Initialize the client. This will connect to the configured backend (if any).
    with TelemetryClient(config=config) as client:
        print(f"Telemetry client initialized for project '{config.project}', run: {config.run_id}")

        # Log hyperparameters
        client.log_hyperparameters(learning_rate=0.001, batch_size=32, epochs=10)
        print("Logged hyperparameters.")

        # Log metrics over steps/epochs
        for epoch in range(3):
            train_loss = 0.5 - epoch * 0.05
            val_loss = 0.6 - epoch * 0.08
            accuracy = 0.7 + epoch * 0.03
            client.log_metrics(step=epoch, train_loss=train_loss, val_loss=val_loss, accuracy=accuracy)
            print(f"Logged metrics for epoch {epoch}: train_loss={train_loss:.3f}, val_loss={val_loss:.3f}, accuracy={accuracy:.3f}")

        # Log an artifact path (this just records the path, not the artifact itself)
        client.log_artifact_path("model_checkpoint", "/path/to/my_model_checkpoint.pt")
        print("Logged artifact path.")

        # Log a final message
        client.log_message("Training run completed successfully.")
        print("Logged completion message.")

except Exception as e:
    print(f"An error occurred during telemetry logging: {e}")
    print("Note: In a real environment, TelemetryClient might require specific endpoint configuration or environment variables (e.g., ONE_LOGGER_MLFLOW_TRACKING_URI, ONE_LOGGER_NEMO_SERVICE_URL) to connect to a telemetry backend like MLflow or NVIDIA NeMo Service. This example primarily demonstrates the API usage, and may not send data to a remote service without proper setup.")

view raw JSON →