NVIDIA Collective Communication Library (NCCL) Runtime for CUDA 12

2.29.7 · active · verified Sat Mar 28

nvidia-nccl-cu12 (version 2.29.7) is the Python package providing the NVIDIA Collective Communication Library (NCCL) runtime specifically built for CUDA 12.x. NCCL is a foundational library for high-performance inter-GPU and inter-node communication primitives, such as all-reduce, all-gather, broadcast, and point-to-point operations, crucial for accelerating distributed deep learning workloads. It features a rapid release cadence, often synchronized with CUDA toolkit and major deep learning framework updates.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how NCCL is typically used indirectly via PyTorch's `torch.distributed` module for multi-GPU collective communication, specifically an `all_reduce` operation. NCCL provides the underlying high-performance backend. A proper distributed launcher (e.g., `torch.distributed.launch` or `mpirun`) is required to run this code across multiple processes/GPUs. For direct Python bindings, consider `nccl4py` for explicit NCCL API calls.

import os
import torch
import torch.distributed as dist

# This quickstart assumes a multi-process setup, typically launched
# via torch.distributed.launch or mpirun, where each process
# runs this script with a unique rank and world_size.

# Example environment variables (set by launch utility):
# os.environ['MASTER_ADDR'] = os.environ.get('MASTER_ADDR', 'localhost')
# os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '29500')
# os.environ['RANK'] = os.environ.get('RANK', '0')
# os.environ['WORLD_SIZE'] = os.environ.get('WORLD_SIZE', '1')

def run_distributed_example(rank, world_size):
    # Initialize the process group with NCCL backend
    print(f"Initializing process group for rank {rank}/{world_size-1}...")
    dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)
    print(f"Process group initialized on rank {rank}.")

    # Set device for the current process
    torch.cuda.set_device(rank)

    # Create a tensor on the GPU
    tensor = torch.ones(10, device=f'cuda:{rank}') * (rank + 1)
    print(f"Rank {rank}: Initial tensor value: {tensor}")

    # Perform an all_reduce operation (summing tensors across all GPUs)
    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)

    print(f"Rank {rank}: Tensor after all_reduce: {tensor}")

    # Clean up the process group
    dist.destroy_process_group()
    print(f"Rank {rank}: Process group destroyed.")

# To run this, you would typically use:
# python -m torch.distributed.launch --nproc_per_node=2 your_script.py
# Or set environment variables and run:
# MASTER_ADDR=localhost MASTER_PORT=29500 RANK=0 WORLD_SIZE=2 python your_script.py
# MASTER_ADDR=localhost MASTER_PORT=29500 RANK=1 WORLD_SIZE=2 python your_script.py

# For simplicity, if running as a single process for structural check:
if __name__ == '__main__':
    # In a real scenario, rank and world_size would be provided by a launcher.
    # This block is for structural demonstration only and will not perform
    # actual distributed communication without a proper launcher.
    try:
        rank = int(os.environ.get('RANK', '0'))
        world_size = int(os.environ.get('WORLD_SIZE', '1'))
        if torch.cuda.is_available() and world_size > 0:
             run_distributed_example(rank, world_size)
        else:
             print("CUDA not available or world_size is 0. Cannot run distributed example.")
    except RuntimeError as e:
        print(f"Error initializing distributed environment: {e}. This often happens if not run with a proper distributed launcher like torch.distributed.launch.")

view raw JSON →