NVIDIA Collective Communication Library (NCCL) Runtime for CUDA 12
nvidia-nccl-cu12 (version 2.29.7) is the Python package providing the NVIDIA Collective Communication Library (NCCL) runtime specifically built for CUDA 12.x. NCCL is a foundational library for high-performance inter-GPU and inter-node communication primitives, such as all-reduce, all-gather, broadcast, and point-to-point operations, crucial for accelerating distributed deep learning workloads. It features a rapid release cadence, often synchronized with CUDA toolkit and major deep learning framework updates.
Warnings
- breaking NCCL versions are tightly coupled with CUDA Toolkit versions and the CUDA version used to compile deep learning frameworks (like PyTorch or TensorFlow). Mismatches can lead to runtime errors, silent performance degradation, or unexpected behavior.
- gotcha The `nvidia-nccl-cu12` package itself primarily provides the `libnccl.so` shared library. Direct Python API calls are not exposed through this package. Instead, Python users interact with NCCL through higher-level libraries like `nccl4py` (official bindings) or as a backend to distributed training modules in frameworks like PyTorch (`torch.distributed`) or TensorFlow (`tf.distribute`).
- gotcha Conflicts can arise if multiple NCCL installations are present on the system (e.g., `nvidia-nccl-cu12` from PyPI, a system-wide `apt`/`dnf` installed NCCL, or one bundled with a deep learning framework). The linker's search path (`LD_LIBRARY_PATH`) can affect which `libnccl.so` is loaded, potentially leading to incorrect versions being used.
Install
-
pip install nvidia-nccl-cu12 -
pip install "nccl4py[cu12]" # Official Python bindings
Imports
- NcclCommunicator
from nccl.core import NcclCommunicator
- lib
from nccl.bindings import lib
- dist
import torch.distributed as dist
- tf.distribute.NcclAllReduce
import tensorflow as tf strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce())
Quickstart
import os
import torch
import torch.distributed as dist
# This quickstart assumes a multi-process setup, typically launched
# via torch.distributed.launch or mpirun, where each process
# runs this script with a unique rank and world_size.
# Example environment variables (set by launch utility):
# os.environ['MASTER_ADDR'] = os.environ.get('MASTER_ADDR', 'localhost')
# os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '29500')
# os.environ['RANK'] = os.environ.get('RANK', '0')
# os.environ['WORLD_SIZE'] = os.environ.get('WORLD_SIZE', '1')
def run_distributed_example(rank, world_size):
# Initialize the process group with NCCL backend
print(f"Initializing process group for rank {rank}/{world_size-1}...")
dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)
print(f"Process group initialized on rank {rank}.")
# Set device for the current process
torch.cuda.set_device(rank)
# Create a tensor on the GPU
tensor = torch.ones(10, device=f'cuda:{rank}') * (rank + 1)
print(f"Rank {rank}: Initial tensor value: {tensor}")
# Perform an all_reduce operation (summing tensors across all GPUs)
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
print(f"Rank {rank}: Tensor after all_reduce: {tensor}")
# Clean up the process group
dist.destroy_process_group()
print(f"Rank {rank}: Process group destroyed.")
# To run this, you would typically use:
# python -m torch.distributed.launch --nproc_per_node=2 your_script.py
# Or set environment variables and run:
# MASTER_ADDR=localhost MASTER_PORT=29500 RANK=0 WORLD_SIZE=2 python your_script.py
# MASTER_ADDR=localhost MASTER_PORT=29500 RANK=1 WORLD_SIZE=2 python your_script.py
# For simplicity, if running as a single process for structural check:
if __name__ == '__main__':
# In a real scenario, rank and world_size would be provided by a launcher.
# This block is for structural demonstration only and will not perform
# actual distributed communication without a proper launcher.
try:
rank = int(os.environ.get('RANK', '0'))
world_size = int(os.environ.get('WORLD_SIZE', '1'))
if torch.cuda.is_available() and world_size > 0:
run_distributed_example(rank, world_size)
else:
print("CUDA not available or world_size is 0. Cannot run distributed example.")
except RuntimeError as e:
print(f"Error initializing distributed environment: {e}. This often happens if not run with a proper distributed launcher like torch.distributed.launch.")