{"id":8363,"library":"nv-one-logger-training-telemetry","title":"NVIDIA One-Logger Training Telemetry","description":"The `nv-one-logger-training-telemetry` library provides tools for capturing and reporting training job telemetry data, integrating with the `one-logger` ecosystem. It enables standardized logging of metrics, hyperparameters, and system information for AI/ML training runs. The current version is 2.3.1 and it is part of the NVIDIA one-logger project, following its release cadence.","status":"active","version":"2.3.1","language":"en","source_language":"en","source_url":"https://github.com/NVIDIA/one-logger/tree/main/telemetry","tags":["nvidia","telemetry","logging","machine-learning","ai","training","metrics","hyperparameters","research"],"install":[{"cmd":"pip install nv-one-logger-training-telemetry","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core dependency for logging infrastructure and backend integration.","package":"one-logger"},{"reason":"Used for data validation and settings management, specifically for TelemetryConfig.","package":"pydantic","optional":false},{"reason":"Used for adding context tags to telemetry events.","package":"nv-context-tagger","optional":false},{"reason":"Used for data serialization of telemetry messages.","package":"protobuf","optional":false}],"imports":[{"note":"`NVSessionClient` was removed in `one-logger` v2.0.0 (the base project for `nvtelemetry`). `TelemetryClient` is the current main entry point.","wrong":"from nvtelemetry.client import NVSessionClient","symbol":"TelemetryClient","correct":"from nvtelemetry.client import TelemetryClient"},{"symbol":"TelemetryConfig","correct":"from nvtelemetry.config import TelemetryConfig"}],"quickstart":{"code":"from nvtelemetry.client import TelemetryClient\nfrom nvtelemetry.config import TelemetryConfig\nfrom datetime import datetime\nimport os\n\n# Example configuration (adjust as needed for actual usage)\n# In a real environment, project and run_id might be set via environment variables\n# or a more complex configuration management system.\n\n# Using os.environ.get for dynamic values, falling back to defaults for example\nproject_name = os.environ.get(\"ONE_LOGGER_PROJECT_NAME\", \"my_ml_project_example\")\nrun_id = os.environ.get(\"ONE_LOGGER_RUN_ID\", f\"run_{datetime.now().strftime('%Y%m%d%H%M%S')}\")\n\nconfig = TelemetryConfig(\n    project=project_name,\n    model=\"my_model_v1\",\n    run_id=run_id,\n    framework=\"pytorch\",\n    framework_version=\"1.13.1\",\n    container_image=\"nvcr.io/nvidia/pytorch:23.05-py3\",\n    tags={\n        \"experiment\": \"initial_test\",\n        \"dataset\": \"cifar10\"\n    },\n    mlflow_tracking_uri=os.environ.get(\"ONE_LOGGER_MLFLOW_TRACKING_URI\", \"\") # Example for MLflow backend\n)\n\ntry:\n    # Initialize the client. This will connect to the configured backend (if any).\n    with TelemetryClient(config=config) as client:\n        print(f\"Telemetry client initialized for project '{config.project}', run: {config.run_id}\")\n\n        # Log hyperparameters\n        client.log_hyperparameters(learning_rate=0.001, batch_size=32, epochs=10)\n        print(\"Logged hyperparameters.\")\n\n        # Log metrics over steps/epochs\n        for epoch in range(3):\n            train_loss = 0.5 - epoch * 0.05\n            val_loss = 0.6 - epoch * 0.08\n            accuracy = 0.7 + epoch * 0.03\n            client.log_metrics(step=epoch, train_loss=train_loss, val_loss=val_loss, accuracy=accuracy)\n            print(f\"Logged metrics for epoch {epoch}: train_loss={train_loss:.3f}, val_loss={val_loss:.3f}, accuracy={accuracy:.3f}\")\n\n        # Log an artifact path (this just records the path, not the artifact itself)\n        client.log_artifact_path(\"model_checkpoint\", \"/path/to/my_model_checkpoint.pt\")\n        print(\"Logged artifact path.\")\n\n        # Log a final message\n        client.log_message(\"Training run completed successfully.\")\n        print(\"Logged completion message.\")\n\nexcept Exception as e:\n    print(f\"An error occurred during telemetry logging: {e}\")\n    print(\"Note: In a real environment, TelemetryClient might require specific endpoint configuration or environment variables (e.g., ONE_LOGGER_MLFLOW_TRACKING_URI, ONE_LOGGER_NEMO_SERVICE_URL) to connect to a telemetry backend like MLflow or NVIDIA NeMo Service. This example primarily demonstrates the API usage, and may not send data to a remote service without proper setup.\")\n","lang":"python","description":"This quickstart demonstrates how to initialize `TelemetryClient` with a `TelemetryConfig` and log common training data such as hyperparameters, metrics over steps, and artifact paths. It includes example environment variable usage for configurability and handles potential connection errors for backend services."},"warnings":[{"fix":"Migrate from `NVSessionClient` to `TelemetryClient`. The new API uses `TelemetryClient` for initialization and `client.log_...` methods.","message":"The `NVSessionClient` class and all `nvs_` prefixed functions were removed in `one-logger` v2.0.0 (the base project for `nvtelemetry`), replaced by `TelemetryClient`.","severity":"breaking","affected_versions":">=2.0.0"},{"fix":"Manually construct `TelemetryConfig` instances or load configuration from environment variables/files using `TelemetryConfig.from_env()` or similar patterns, rather than relying on the removed setup functions.","message":"The direct utility functions `nvtelemetry.config.setup_environment_config` and `config.get_environment_config` were removed/deprecated in `one-logger` v2.3.0. Configuration should now be managed directly via `TelemetryConfig` instance.","severity":"breaking","affected_versions":">=2.3.0"},{"fix":"Ensure environment variables like `ONE_LOGGER_PROJECT_NAME`, `ONE_LOGGER_MLFLOW_TRACKING_URI`, `ONE_LOGGER_NEMO_SERVICE_URL` are correctly set, or pass appropriate configurations directly to `TelemetryConfig` for your target backend.","message":"The `nvtelemetry` client requires a compatible telemetry backend (e.g., NVIDIA NeMo Service, ClearML, MLflow) to send data. Without proper configuration, logs might be collected locally but not sent remotely, or fail silently.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Replace `from nvtelemetry.client import NVSessionClient` with `from nvtelemetry.client import TelemetryClient` and update its usage.","cause":"`NVSessionClient` was removed in `one-logger` v2.0.0, replaced by `TelemetryClient`.","error":"ImportError: cannot import name 'NVSessionClient' from 'nvtelemetry.client'"},{"fix":"Directly instantiate `TelemetryConfig` or use `TelemetryConfig.from_env()` to load configuration, instead of calling the removed setup function.","cause":"`setup_environment_config` was removed in `one-logger` v2.3.0.","error":"AttributeError: module 'nvtelemetry.config' has no attribute 'setup_environment_config'"},{"fix":"Verify the environment variables for the telemetry backend (e.g., `ONE_LOGGER_MLFLOW_TRACKING_URI`, `ONE_LOGGER_NEMO_SERVICE_URL`) are correctly set, or ensure the backend service is running and accessible.","cause":"The telemetry client could not connect to the specified backend (e.g., MLflow, NeMo Service) due to misconfiguration or unavailability.","error":"Telemetry client failed to connect to backend: <specific connection error>"}]}