NVIDIA Collective Communication Library (NCCL) Runtime
The `nvidia-nccl-cu13` package provides the NVIDIA Collective Communication Library (NCCL) runtime specific to CUDA 13.x. NCCL is a library of standard routines for inter-GPU communication, optimized for NVIDIA GPUs. It is primarily used as a backend by deep learning frameworks like PyTorch and TensorFlow for distributed training on multi-GPU systems. This package does not expose a direct Python API for end-users but provides the necessary shared libraries. It's released in conjunction with NVIDIA CUDA Toolkit versions.
Warnings
- gotcha This package is a runtime dependency and does NOT expose a direct Python API. You typically won't `import nvidia_nccl` or `import nccl` in your Python code. Its functionality is leveraged internally by higher-level deep learning frameworks.
- breaking CUDA Version Mismatch: The `nvidia-nccl-cu13` package is specifically compiled for CUDA 13.x. Using it with a different CUDA version (e.g., CUDA 12.x or 11.x) installed on your system or expected by your deep learning framework can lead to runtime errors (e.g., `_nccl_create_comm` failed, symbol lookup errors).
- gotcha Conflicts with Framework-Bundled NCCL: Some deep learning frameworks (e.g., PyTorch, TensorFlow) might ship with their own pre-compiled NCCL libraries, or they might expect a specific version of NCCL installed globally. This can lead to conflicts if the `nvidia-nccl-cu13` package's version doesn't align with the framework's expectation.
Install
-
pip install nvidia-nccl-cu13
Imports
- NCCL Runtime (indirect usage)
This package primarily provides shared library files (e.g., libnccl.so) that deep learning frameworks (like PyTorch or TensorFlow) link against. It does NOT expose a direct Python API for end-user import.
Quickstart
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = os.environ.get('MASTER_ADDR', 'localhost')
os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '29500')
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
class ToyModel(torch.nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = torch.nn.Linear(10, 10)
self.relu = torch.nn.ReLU()
self.net2 = torch.nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def demo_basic(rank, world_size):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
# Use a GPU if available, otherwise CPU (though NCCL requires GPUs)
device = torch.device(f'cuda:{rank}' if torch.cuda.is_available() else 'cpu')
model = ToyModel().to(device)
ddp_model = DDP(model, device_ids=[rank] if torch.cuda.is_available() else None)
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(ddp_model.parameters(), lr=0.001)
for _ in range(3):
inputs = torch.randn(20, 10).to(device)
labels = torch.randn(20, 5).to(device)
optimizer.zero_grad()
outputs = ddp_model(inputs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
if rank == 0: # Only print from rank 0 to avoid floods
print(f"Rank {rank}, Loss: {loss.item():.4f}")
cleanup()
if __name__ == "__main__":
# This example requires multiple processes to run.
# You would typically run this using torch.distributed.launch or torchrun:
# python -m torch.distributed.run --nproc_per_node=2 your_script.py
# For a single-process 'dry run' for syntax:
# Note: NCCL backend will fail if not run in a multi-GPU DDP setup.
# world_size = 1 # For dry-run, will likely fail with NCCL backend
# rank = 0
# demo_basic(rank, world_size)
print("This script demonstrates NCCL usage via PyTorch DDP.")
print("To run, execute with `torchrun --nproc_per_node=<num_gpus> your_script.py`")
print("e.g., `torchrun --nproc_per_node=2 quickstart.py`")