NV One Logger PyTorch Lightning Integration
The `nv-one-logger-pytorch-lightning-integration` library (current version 2.3.1) provides wrappers to enable training job telemetry for PyTorch Lightning applications. It integrates with PyTorch Lightning's callback mechanism, supplementing it to ensure support for asynchronous checkpointing and certain application lifecycle events not fully covered by Lightning's native callbacks. The library aims for automatic telemetry collection and seamless integration with the NVIDIA One Logger training telemetry system, and has a healthy version release cadence.
Common errors
-
OSError: [Errno 24] Too many open files: '/home/.../onelogger.err'
cause A file descriptor leak within the underlying `nv-one-logger` components, particularly when logging extensively or over long training durations.fixUpgrade `nv-one-logger-pytorch-lightning-integration` and its foundational `nv-one-logger` packages to their latest versions. This specific error was noted to be resolved in related components around version 2.5.0. -
WARNING - Skipping execution of on_train_start because OneLogger is not enabled.
cause The `nv-one-logger` system (specifically the `TrainingTelemetryProvider`) was not correctly initialized or configured before the PyTorch Lightning integration attempted to log an event.fixBefore initializing your `HookedTrainer` or calling `hook_trainer_cls`, ensure you've properly configured and enabled the `TrainingTelemetryProvider` by calling `TrainingTelemetryProvider.instance().configure_provider()` with any necessary exporters. This step is crucial for activating the telemetry system. -
PyTorch Lightning logger not showing per-step or per-epoch intermediate metrics (e.g., `test_accuracy` is an aggregate instead of values per epoch).
cause This is a common PyTorch Lightning logging behavior. By default, `self.log` often aggregates metrics at the epoch level, and `test_step`/`validation_step` frequently log only final aggregated values unless explicitly configured.fixWhen calling `self.log()` within your `LightningModule`, explicitly set `on_step=True` and/or `on_epoch=True` to control the logging frequency. Be aware that for `test_step` and `validation_step`, metrics are often aggregated by default across the entire run or epoch. Adjust your logging strategy or post-processing if you need per-step visibility in these phases.
Warnings
- gotcha Not all training events are implicitly captured by the PyTorch Lightning integration. Some specific application lifecycle events (e.g., `on_model_init_start`, `on_dataloader_init_start`) require explicit calls to the corresponding `TimeEventCallback.on_xxx` methods for telemetry collection.
- gotcha During multi-GPU or distributed training, you may encounter numerous warnings like 'Skipping execution of on_train_start because OneLogger is not enabled.' This typically indicates that the `OneLogger` system was not properly initialized or enabled across all processes.
- gotcha Users might encounter an `OSError: [Errno 24] Too many open files` during long training runs when using `nv-one-logger` integrations, specifically referencing `onelogger.err` or `onelogger.log` files.
Install
-
pip install nv-one-logger-pytorch-lightning-integration
Imports
- hook_trainer_cls
from nv_one_logger.training_telemetry.integration.pytorch_lightning import hook_trainer_cls
- TrainingTelemetryProvider
from nv_one_logger.training_telemetry.api.training_telemetry_provider import TrainingTelemetryProvider
Quickstart
import os
import torch
from pytorch_lightning import LightningModule, Trainer
from torch.utils.data import DataLoader, Dataset
from nv_one_logger.training_telemetry.api.training_telemetry_provider import TrainingTelemetryProvider
from nv_one_logger.training_telemetry.integration.pytorch_lightning import hook_trainer_cls
# --- Dummy components for a runnable example ---
class DummyDataset(Dataset):
def __len__(self):
return 64
def __getitem__(self, idx):
return torch.randn(10), torch.randint(0, 2, (1,)).squeeze()
class SimpleModel(LightningModule):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(10, 2)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.linear(x)
loss = torch.nn.functional.cross_entropy(y_hat, y)
self.log('train_loss', loss)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
# --- NV One Logger Integration ---
# 1. Configure the TrainingTelemetryProvider (e.g., with a simple console exporter for demo)
# In a real scenario, this would be configured with a proper exporter like OpenTelemetry, WandB, etc.
# For a console exporter, you might not need extensive config, but for others, it's crucial.
# For demonstration, we'll assume a basic provider without complex exporter setup is sufficient.
# In a real application, you'd likely use .with_exporter(OTELHttpExporter(...))
TrainingTelemetryProvider.instance().configure_provider()
# 2. Hook the PyTorch Lightning Trainer class
HookedTrainer, nv_one_logger_callback = hook_trainer_cls(Trainer, TrainingTelemetryProvider.instance())
# 3. Instantiate your model and data loaders
model = SimpleModel()
train_dataset = DummyDataset()
train_dataloader = DataLoader(train_dataset, batch_size=4)
# 4. Use the HookedTrainer instance
# Pass it the same parameters you would pass to the regular Lightning Trainer.
# The nv_one_logger_callback is automatically added, no need to pass it explicitly.
trainer = HookedTrainer(
max_epochs=1,
limit_train_batches=2, # Limit batches for a quick run
logger=False, # Disable default PTL loggers if not needed, or add others
accelerator='cpu' # Ensure it runs on CPU for general demonstration
)
# 5. Train the model
trainer.fit(model, train_dataloader)
print("Training complete with NV One Logger integration.")