{"id":8362,"library":"nv-one-logger-pytorch-lightning-integration","title":"NV One Logger PyTorch Lightning Integration","description":"The `nv-one-logger-pytorch-lightning-integration` library (current version 2.3.1) provides wrappers to enable training job telemetry for PyTorch Lightning applications. It integrates with PyTorch Lightning's callback mechanism, supplementing it to ensure support for asynchronous checkpointing and certain application lifecycle events not fully covered by Lightning's native callbacks. The library aims for automatic telemetry collection and seamless integration with the NVIDIA One Logger training telemetry system, and has a healthy version release cadence.","status":"active","version":"2.3.1","language":"en","source_language":"en","source_url":"https://github.com/NVIDIA/nv-one-logger","tags":["pytorch-lightning","logging","telemetry","nvidia","mlops","gpu","training"],"install":[{"cmd":"pip install nv-one-logger-pytorch-lightning-integration","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core framework for integration.","package":"pytorch-lightning","optional":false},{"reason":"Underlying telemetry library, implicitly required.","package":"nv-one-logger-training-telemetry","optional":false},{"reason":"Required by PyTorch Lightning.","package":"torch","optional":false}],"imports":[{"note":"Used to wrap the PyTorch Lightning Trainer for telemetry.","symbol":"hook_trainer_cls","correct":"from nv_one_logger.training_telemetry.integration.pytorch_lightning import hook_trainer_cls"},{"note":"Required to configure the One Logger telemetry system.","symbol":"TrainingTelemetryProvider","correct":"from nv_one_logger.training_telemetry.api.training_telemetry_provider import TrainingTelemetryProvider"}],"quickstart":{"code":"import os\nimport torch\nfrom pytorch_lightning import LightningModule, Trainer\nfrom torch.utils.data import DataLoader, Dataset\nfrom nv_one_logger.training_telemetry.api.training_telemetry_provider import TrainingTelemetryProvider\nfrom nv_one_logger.training_telemetry.integration.pytorch_lightning import hook_trainer_cls\n\n# --- Dummy components for a runnable example ---\nclass DummyDataset(Dataset):\n    def __len__(self):\n        return 64\n    def __getitem__(self, idx):\n        return torch.randn(10), torch.randint(0, 2, (1,)).squeeze()\n\nclass SimpleModel(LightningModule):\n    def __init__(self):\n        super().__init__()\n        self.linear = torch.nn.Linear(10, 2)\n    def training_step(self, batch, batch_idx):\n        x, y = batch\n        y_hat = self.linear(x)\n        loss = torch.nn.functional.cross_entropy(y_hat, y)\n        self.log('train_loss', loss)\n        return loss\n    def configure_optimizers(self):\n        return torch.optim.Adam(self.parameters(), lr=0.02)\n\n# --- NV One Logger Integration ---\n# 1. Configure the TrainingTelemetryProvider (e.g., with a simple console exporter for demo)\n# In a real scenario, this would be configured with a proper exporter like OpenTelemetry, WandB, etc.\n# For a console exporter, you might not need extensive config, but for others, it's crucial.\n# For demonstration, we'll assume a basic provider without complex exporter setup is sufficient.\n# In a real application, you'd likely use .with_exporter(OTELHttpExporter(...))\nTrainingTelemetryProvider.instance().configure_provider()\n\n# 2. Hook the PyTorch Lightning Trainer class\nHookedTrainer, nv_one_logger_callback = hook_trainer_cls(Trainer, TrainingTelemetryProvider.instance())\n\n# 3. Instantiate your model and data loaders\nmodel = SimpleModel()\ntrain_dataset = DummyDataset()\ntrain_dataloader = DataLoader(train_dataset, batch_size=4)\n\n# 4. Use the HookedTrainer instance\n# Pass it the same parameters you would pass to the regular Lightning Trainer.\n# The nv_one_logger_callback is automatically added, no need to pass it explicitly.\ntrainer = HookedTrainer(\n    max_epochs=1,\n    limit_train_batches=2, # Limit batches for a quick run\n    logger=False, # Disable default PTL loggers if not needed, or add others\n    accelerator='cpu' # Ensure it runs on CPU for general demonstration\n)\n\n# 5. Train the model\ntrainer.fit(model, train_dataloader)\n\nprint(\"Training complete with NV One Logger integration.\")\n","lang":"python","description":"This quickstart demonstrates how to integrate `nv-one-logger` with a basic PyTorch Lightning training loop. It involves configuring the `TrainingTelemetryProvider` and then using `hook_trainer_cls` to wrap the standard `Trainer`. The `HookedTrainer` automatically adds the necessary callbacks for telemetry collection. For production, the `TrainingTelemetryProvider` would be configured with a specific exporter (e.g., OpenTelemetry, Weights & Biases) to send telemetry data to a backend."},"warnings":[{"fix":"Consult the `nv-one-logger` documentation for a list of implicit vs. explicit telemetry calls and add explicit calls where needed for desired granularity.","message":"Not all training events are implicitly captured by the PyTorch Lightning integration. Some specific application lifecycle events (e.g., `on_model_init_start`, `on_dataloader_init_start`) require explicit calls to the corresponding `TimeEventCallback.on_xxx` methods for telemetry collection.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure `TrainingTelemetryProvider.instance().configure_provider()` is called effectively and globally, often at the very start of your script or within a setup function that runs on all processes, before any `Trainer` or `hook_trainer_cls` initialization. Verify that all necessary environment variables or configuration files for `OneLogger` are accessible to all ranks.","message":"During multi-GPU or distributed training, you may encounter numerous warnings like 'Skipping execution of on_train_start because OneLogger is not enabled.' This typically indicates that the `OneLogger` system was not properly initialized or enabled across all processes.","severity":"gotcha","affected_versions":"All versions"},{"fix":"This issue was reported and resolved in related `OneLogger` components (e.g., `OneLoggerNeMoCallback` in NeMo version 2.5.0). Ensure you are using the latest stable versions of `nv-one-logger-pytorch-lightning-integration` and its core `nv-one-logger` dependencies to benefit from file handle management fixes.","message":"Users might encounter an `OSError: [Errno 24] Too many open files` during long training runs when using `nv-one-logger` integrations, specifically referencing `onelogger.err` or `onelogger.log` files.","severity":"gotcha","affected_versions":"< 2.5.0 (of related `OneLogger` components)"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Upgrade `nv-one-logger-pytorch-lightning-integration` and its foundational `nv-one-logger` packages to their latest versions. This specific error was noted to be resolved in related components around version 2.5.0.","cause":"A file descriptor leak within the underlying `nv-one-logger` components, particularly when logging extensively or over long training durations.","error":"OSError: [Errno 24] Too many open files: '/home/.../onelogger.err'"},{"fix":"Before initializing your `HookedTrainer` or calling `hook_trainer_cls`, ensure you've properly configured and enabled the `TrainingTelemetryProvider` by calling `TrainingTelemetryProvider.instance().configure_provider()` with any necessary exporters. This step is crucial for activating the telemetry system.","cause":"The `nv-one-logger` system (specifically the `TrainingTelemetryProvider`) was not correctly initialized or configured before the PyTorch Lightning integration attempted to log an event.","error":"WARNING - Skipping execution of on_train_start because OneLogger is not enabled."},{"fix":"When calling `self.log()` within your `LightningModule`, explicitly set `on_step=True` and/or `on_epoch=True` to control the logging frequency. Be aware that for `test_step` and `validation_step`, metrics are often aggregated by default across the entire run or epoch. Adjust your logging strategy or post-processing if you need per-step visibility in these phases.","cause":"This is a common PyTorch Lightning logging behavior. By default, `self.log` often aggregates metrics at the epoch level, and `test_step`/`validation_step` frequently log only final aggregated values unless explicitly configured.","error":"PyTorch Lightning logger not showing per-step or per-epoch intermediate metrics (e.g., `test_accuracy` is an aggregate instead of values per epoch)."}]}