NVIDIA Collective Communication Library (NCCL) Runtime

2.29.7 · active · verified Thu Apr 09

The `nvidia-nccl-cu13` package provides the NVIDIA Collective Communication Library (NCCL) runtime specific to CUDA 13.x. NCCL is a library of standard routines for inter-GPU communication, optimized for NVIDIA GPUs. It is primarily used as a backend by deep learning frameworks like PyTorch and TensorFlow for distributed training on multi-GPU systems. This package does not expose a direct Python API for end-users but provides the necessary shared libraries. It's released in conjunction with NVIDIA CUDA Toolkit versions.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how NCCL is implicitly used by PyTorch for distributed data parallel (DDP) training across multiple GPUs. The `nvidia-nccl-cu13` package provides the underlying `libnccl.so` library that `torch.distributed` links against when `dist.init_process_group` is called with the 'nccl' backend. The code sets up a minimal DDP training loop. You would run this script using `torchrun` (part of PyTorch) to launch multiple processes, each assigned to a GPU.

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = os.environ.get('MASTER_ADDR', 'localhost')
    os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '29500')
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

class ToyModel(torch.nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = torch.nn.Linear(10, 10)
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # Use a GPU if available, otherwise CPU (though NCCL requires GPUs)
    device = torch.device(f'cuda:{rank}' if torch.cuda.is_available() else 'cpu')
    model = ToyModel().to(device)
    ddp_model = DDP(model, device_ids=[rank] if torch.cuda.is_available() else None)

    loss_fn = torch.nn.MSELoss()
    optimizer = torch.optim.SGD(ddp_model.parameters(), lr=0.001)

    for _ in range(3):
        inputs = torch.randn(20, 10).to(device)
        labels = torch.randn(20, 5).to(device)
        optimizer.zero_grad()
        outputs = ddp_model(inputs)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()
        if rank == 0: # Only print from rank 0 to avoid floods
            print(f"Rank {rank}, Loss: {loss.item():.4f}")

    cleanup()

if __name__ == "__main__":
    # This example requires multiple processes to run.
    # You would typically run this using torch.distributed.launch or torchrun:
    # python -m torch.distributed.run --nproc_per_node=2 your_script.py
    # For a single-process 'dry run' for syntax:
    # Note: NCCL backend will fail if not run in a multi-GPU DDP setup.
    # world_size = 1 # For dry-run, will likely fail with NCCL backend
    # rank = 0
    # demo_basic(rank, world_size)
    print("This script demonstrates NCCL usage via PyTorch DDP.")
    print("To run, execute with `torchrun --nproc_per_node=<num_gpus> your_script.py`")
    print("e.g., `torchrun --nproc_per_node=2 quickstart.py`")

view raw JSON →