NVIDIA One-Logger Training Telemetry
The `nv-one-logger-training-telemetry` library provides tools for capturing and reporting training job telemetry data, integrating with the `one-logger` ecosystem. It enables standardized logging of metrics, hyperparameters, and system information for AI/ML training runs. The current version is 2.3.1 and it is part of the NVIDIA one-logger project, following its release cadence.
Common errors
-
ImportError: cannot import name 'NVSessionClient' from 'nvtelemetry.client'
cause `NVSessionClient` was removed in `one-logger` v2.0.0, replaced by `TelemetryClient`.fixReplace `from nvtelemetry.client import NVSessionClient` with `from nvtelemetry.client import TelemetryClient` and update its usage. -
AttributeError: module 'nvtelemetry.config' has no attribute 'setup_environment_config'
cause `setup_environment_config` was removed in `one-logger` v2.3.0.fixDirectly instantiate `TelemetryConfig` or use `TelemetryConfig.from_env()` to load configuration, instead of calling the removed setup function. -
Telemetry client failed to connect to backend: <specific connection error>
cause The telemetry client could not connect to the specified backend (e.g., MLflow, NeMo Service) due to misconfiguration or unavailability.fixVerify the environment variables for the telemetry backend (e.g., `ONE_LOGGER_MLFLOW_TRACKING_URI`, `ONE_LOGGER_NEMO_SERVICE_URL`) are correctly set, or ensure the backend service is running and accessible.
Warnings
- breaking The `NVSessionClient` class and all `nvs_` prefixed functions were removed in `one-logger` v2.0.0 (the base project for `nvtelemetry`), replaced by `TelemetryClient`.
- breaking The direct utility functions `nvtelemetry.config.setup_environment_config` and `config.get_environment_config` were removed/deprecated in `one-logger` v2.3.0. Configuration should now be managed directly via `TelemetryConfig` instance.
- gotcha The `nvtelemetry` client requires a compatible telemetry backend (e.g., NVIDIA NeMo Service, ClearML, MLflow) to send data. Without proper configuration, logs might be collected locally but not sent remotely, or fail silently.
Install
-
pip install nv-one-logger-training-telemetry
Imports
- TelemetryClient
from nvtelemetry.client import NVSessionClient
from nvtelemetry.client import TelemetryClient
- TelemetryConfig
from nvtelemetry.config import TelemetryConfig
Quickstart
from nvtelemetry.client import TelemetryClient
from nvtelemetry.config import TelemetryConfig
from datetime import datetime
import os
# Example configuration (adjust as needed for actual usage)
# In a real environment, project and run_id might be set via environment variables
# or a more complex configuration management system.
# Using os.environ.get for dynamic values, falling back to defaults for example
project_name = os.environ.get("ONE_LOGGER_PROJECT_NAME", "my_ml_project_example")
run_id = os.environ.get("ONE_LOGGER_RUN_ID", f"run_{datetime.now().strftime('%Y%m%d%H%M%S')}")
config = TelemetryConfig(
project=project_name,
model="my_model_v1",
run_id=run_id,
framework="pytorch",
framework_version="1.13.1",
container_image="nvcr.io/nvidia/pytorch:23.05-py3",
tags={
"experiment": "initial_test",
"dataset": "cifar10"
},
mlflow_tracking_uri=os.environ.get("ONE_LOGGER_MLFLOW_TRACKING_URI", "") # Example for MLflow backend
)
try:
# Initialize the client. This will connect to the configured backend (if any).
with TelemetryClient(config=config) as client:
print(f"Telemetry client initialized for project '{config.project}', run: {config.run_id}")
# Log hyperparameters
client.log_hyperparameters(learning_rate=0.001, batch_size=32, epochs=10)
print("Logged hyperparameters.")
# Log metrics over steps/epochs
for epoch in range(3):
train_loss = 0.5 - epoch * 0.05
val_loss = 0.6 - epoch * 0.08
accuracy = 0.7 + epoch * 0.03
client.log_metrics(step=epoch, train_loss=train_loss, val_loss=val_loss, accuracy=accuracy)
print(f"Logged metrics for epoch {epoch}: train_loss={train_loss:.3f}, val_loss={val_loss:.3f}, accuracy={accuracy:.3f}")
# Log an artifact path (this just records the path, not the artifact itself)
client.log_artifact_path("model_checkpoint", "/path/to/my_model_checkpoint.pt")
print("Logged artifact path.")
# Log a final message
client.log_message("Training run completed successfully.")
print("Logged completion message.")
except Exception as e:
print(f"An error occurred during telemetry logging: {e}")
print("Note: In a real environment, TelemetryClient might require specific endpoint configuration or environment variables (e.g., ONE_LOGGER_MLFLOW_TRACKING_URI, ONE_LOGGER_NEMO_SERVICE_URL) to connect to a telemetry backend like MLflow or NVIDIA NeMo Service. This example primarily demonstrates the API usage, and may not send data to a remote service without proper setup.")