NV One Logger PyTorch Lightning Integration

2.3.1 · active · verified Thu Apr 16

The `nv-one-logger-pytorch-lightning-integration` library (current version 2.3.1) provides wrappers to enable training job telemetry for PyTorch Lightning applications. It integrates with PyTorch Lightning's callback mechanism, supplementing it to ensure support for asynchronous checkpointing and certain application lifecycle events not fully covered by Lightning's native callbacks. The library aims for automatic telemetry collection and seamless integration with the NVIDIA One Logger training telemetry system, and has a healthy version release cadence.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to integrate `nv-one-logger` with a basic PyTorch Lightning training loop. It involves configuring the `TrainingTelemetryProvider` and then using `hook_trainer_cls` to wrap the standard `Trainer`. The `HookedTrainer` automatically adds the necessary callbacks for telemetry collection. For production, the `TrainingTelemetryProvider` would be configured with a specific exporter (e.g., OpenTelemetry, Weights & Biases) to send telemetry data to a backend.

import os
import torch
from pytorch_lightning import LightningModule, Trainer
from torch.utils.data import DataLoader, Dataset
from nv_one_logger.training_telemetry.api.training_telemetry_provider import TrainingTelemetryProvider
from nv_one_logger.training_telemetry.integration.pytorch_lightning import hook_trainer_cls

# --- Dummy components for a runnable example ---
class DummyDataset(Dataset):
    def __len__(self):
        return 64
    def __getitem__(self, idx):
        return torch.randn(10), torch.randint(0, 2, (1,)).squeeze()

class SimpleModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(10, 2)
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.linear(x)
        loss = torch.nn.functional.cross_entropy(y_hat, y)
        self.log('train_loss', loss)
        return loss
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

# --- NV One Logger Integration ---
# 1. Configure the TrainingTelemetryProvider (e.g., with a simple console exporter for demo)
# In a real scenario, this would be configured with a proper exporter like OpenTelemetry, WandB, etc.
# For a console exporter, you might not need extensive config, but for others, it's crucial.
# For demonstration, we'll assume a basic provider without complex exporter setup is sufficient.
# In a real application, you'd likely use .with_exporter(OTELHttpExporter(...))
TrainingTelemetryProvider.instance().configure_provider()

# 2. Hook the PyTorch Lightning Trainer class
HookedTrainer, nv_one_logger_callback = hook_trainer_cls(Trainer, TrainingTelemetryProvider.instance())

# 3. Instantiate your model and data loaders
model = SimpleModel()
train_dataset = DummyDataset()
train_dataloader = DataLoader(train_dataset, batch_size=4)

# 4. Use the HookedTrainer instance
# Pass it the same parameters you would pass to the regular Lightning Trainer.
# The nv_one_logger_callback is automatically added, no need to pass it explicitly.
trainer = HookedTrainer(
    max_epochs=1,
    limit_train_batches=2, # Limit batches for a quick run
    logger=False, # Disable default PTL loggers if not needed, or add others
    accelerator='cpu' # Ensure it runs on CPU for general demonstration
)

# 5. Train the model
trainer.fit(model, train_dataloader)

print("Training complete with NV One Logger integration.")

view raw JSON →