NVIDIA Collective Communication Library (NCCL) Runtime for CUDA 12

raw JSON →
2.29.7 verified Tue May 12 auth: no python install: stale quickstart: stale

nvidia-nccl-cu12 (version 2.29.7) is the Python package providing the NVIDIA Collective Communication Library (NCCL) runtime specifically built for CUDA 12.x. NCCL is a foundational library for high-performance inter-GPU and inter-node communication primitives, such as all-reduce, all-gather, broadcast, and point-to-point operations, crucial for accelerating distributed deep learning workloads. It features a rapid release cadence, often synchronized with CUDA toolkit and major deep learning framework updates.

pip install nvidia-nccl-cu12
error RuntimeError: NCCL Error 1: unhandled cuda error
cause This generic error often indicates an underlying issue with CUDA or GPU, such as out-of-memory errors, incompatible CUDA/driver/framework versions, or a temporary hardware problem during distributed training.
fix
Ensure your CUDA toolkit, GPU drivers, and deep learning framework (e.g., PyTorch) versions are compatible. Monitor GPU memory usage for out-of-memory issues. Rerun your application with NCCL_DEBUG=INFO or NCCL_DEBUG=WARN environment variables to get more detailed logs that can pinpoint the specific CUDA error.
error RuntimeError: NCCL Error 2: unhandled system error
cause This error typically arises from misconfigurations related to system resources, particularly Linux shared memory (e.g., `/dev/shm`) used for inter-process communication, or issues with InfiniBand network setup.
fix
Set the environment variable NCCL_SHM_DISABLE=1 to prevent NCCL from using shared memory. Verify that /dev/shm is correctly mounted and has sufficient space, especially in containerized environments. For InfiniBand, check network connectivity and drivers.
error ERROR: Could not find a version that satisfies the requirement nvidia-nccl-cu12==X.Y.Z (from versions: ...)
cause This installation error occurs when the specified version of `nvidia-nccl-cu12` is not available for your current Python version, operating system, or architecture on PyPI, or if there's a conflict with another library's NCCL dependency.
fix
Check the available versions on PyPI (pypi.org/project/nvidia-nccl-cu12/#files) for your specific environment (Python version, OS, architecture). Ensure your deep learning framework's CUDA version is compatible with the nvidia-nccl-cu12 package you are trying to install. If using a framework like PyTorch or TensorFlow, sometimes they manage NCCL internally, and explicit installation of nvidia-nccl-cu12 might not be necessary or can cause conflicts.
error RuntimeError: Distributed package doesn't have NCCL built in
cause This PyTorch-specific error indicates that your PyTorch installation was not compiled with NCCL support, or the NCCL library cannot be found or loaded by PyTorch at runtime.
fix
Reinstall PyTorch making sure to specify a CUDA-enabled version, for example: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121. Ensure that the nvidia-nccl-cu12 package or the system-wide NCCL library is correctly installed and accessible in your environment's LD_LIBRARY_PATH.
breaking NCCL versions are tightly coupled with CUDA Toolkit versions and the CUDA version used to compile deep learning frameworks (like PyTorch or TensorFlow). Mismatches can lead to runtime errors, silent performance degradation, or unexpected behavior.
fix Ensure that the `nvidia-nccl-cu12` package, your system's CUDA Toolkit, and the CUDA version used by your deep learning framework are all compatible. Consult the NVIDIA documentation or framework-specific guides for compatibility matrices. For PyTorch, `torch.cuda.is_available()` and `torch.version.cuda` can help verify. For `nccl4py`, use `pip install "nccl4py[cu12]"` to ensure correct CUDA 12 support.
gotcha The `nvidia-nccl-cu12` package itself primarily provides the `libnccl.so` shared library. Direct Python API calls are not exposed through this package. Instead, Python users interact with NCCL through higher-level libraries like `nccl4py` (official bindings) or as a backend to distributed training modules in frameworks like PyTorch (`torch.distributed`) or TensorFlow (`tf.distribute`).
fix To use NCCL directly from Python, install and import `nccl4py`. If using with a deep learning framework, configure its distributed module to use the NCCL backend. Avoid `import nccl` for direct API calls, as this package is a runtime provider.
gotcha Conflicts can arise if multiple NCCL installations are present on the system (e.g., `nvidia-nccl-cu12` from PyPI, a system-wide `apt`/`dnf` installed NCCL, or one bundled with a deep learning framework). The linker's search path (`LD_LIBRARY_PATH`) can affect which `libnccl.so` is loaded, potentially leading to incorrect versions being used.
fix Prefer using `nvidia-nccl-cu12` installed via pip for consistency within Python environments. If system-wide NCCL is necessary, carefully manage `LD_LIBRARY_PATH` to ensure the correct `libnccl.so` is prioritized. Frameworks like PyTorch often statically link NCCL, mitigating some of these issues, but custom builds might need `USE_SYSTEM_NCCL` flags.
breaking The `nccl4py[cu12]` package, while recommended for direct Python interaction with NCCL CUDA 12, may not always have pre-built wheels available for all Python versions, operating systems, or architectures on PyPI. This can lead to `ERROR: Could not find a version that satisfies the requirement` during installation.
fix Verify the availability of `nccl4py[cu12]` for your specific Python version and OS on PyPI or the official `nccl4py` documentation. If pre-built wheels are not available, you might need to compile `nccl4py` from source (which requires a CUDA Toolkit installation and potentially other build dependencies) or consider using a deep learning framework's distributed module, which often bundles NCCL or manages its own bindings.
breaking The `nvidia-nccl-cu12` package is not directly available on the default PyPI.org repository. It is hosted on the NVIDIA Python Package Index, and attempting to install it without configuring this index will result in a build error indicating it's a "placeholder project".
fix To install `nvidia-nccl-cu12`, you must first install `nvidia-pyindex` to configure the NVIDIA Python Package Index, or specify the NVIDIA index URL directly. For example: `pip install nvidia-pyindex && pip install nvidia-nccl-cu12`, or `pip install --extra-index-url https://pypi.ngc.nvidia.com nvidia-nccl-cu12`.
pip install "nccl4py[cu12]" # Official Python bindings
python os / libc variant status wheel install import disk
3.10 alpine (musl) cu12 no_wheel - - - -
3.10 alpine (musl) nvidia-nccl-cu12 build_error - - - -
3.10 alpine (musl) cu12 - - - -
3.10 alpine (musl) nvidia-nccl-cu12 - - - -
3.10 slim (glibc) cu12 - - - -
3.10 slim (glibc) nvidia-nccl-cu12 wheel 7.8s - 412M
3.10 slim (glibc) cu12 - - - -
3.10 slim (glibc) nvidia-nccl-cu12 - - - -
3.11 alpine (musl) cu12 no_wheel - - - -
3.11 alpine (musl) nvidia-nccl-cu12 build_error - - - -
3.11 alpine (musl) cu12 - - - -
3.11 alpine (musl) nvidia-nccl-cu12 - - - -
3.11 slim (glibc) cu12 - - - -
3.11 slim (glibc) nvidia-nccl-cu12 wheel 7.6s - 414M
3.11 slim (glibc) cu12 - - - -
3.11 slim (glibc) nvidia-nccl-cu12 - - - -
3.12 alpine (musl) cu12 no_wheel - - - -
3.12 alpine (musl) nvidia-nccl-cu12 build_error - - - -
3.12 alpine (musl) cu12 - - - -
3.12 alpine (musl) nvidia-nccl-cu12 - - - -
3.12 slim (glibc) cu12 - - - -
3.12 slim (glibc) nvidia-nccl-cu12 wheel 7.4s - 406M
3.12 slim (glibc) cu12 - - - -
3.12 slim (glibc) nvidia-nccl-cu12 - - - -
3.13 alpine (musl) cu12 no_wheel - - - -
3.13 alpine (musl) nvidia-nccl-cu12 build_error - - - -
3.13 alpine (musl) cu12 - - - -
3.13 alpine (musl) nvidia-nccl-cu12 - - - -
3.13 slim (glibc) cu12 - - - -
3.13 slim (glibc) nvidia-nccl-cu12 wheel 7.1s - 406M
3.13 slim (glibc) cu12 - - - -
3.13 slim (glibc) nvidia-nccl-cu12 - - - -
3.9 alpine (musl) cu12 no_wheel - - - -
3.9 alpine (musl) nvidia-nccl-cu12 build_error - - - -
3.9 alpine (musl) cu12 - - - -
3.9 alpine (musl) nvidia-nccl-cu12 - - - -
3.9 slim (glibc) cu12 no_wheel - - - -
3.9 slim (glibc) nvidia-nccl-cu12 wheel 8.1s - 412M
3.9 slim (glibc) cu12 - - - -
3.9 slim (glibc) nvidia-nccl-cu12 - - - -

This quickstart demonstrates how NCCL is typically used indirectly via PyTorch's `torch.distributed` module for multi-GPU collective communication, specifically an `all_reduce` operation. NCCL provides the underlying high-performance backend. A proper distributed launcher (e.g., `torch.distributed.launch` or `mpirun`) is required to run this code across multiple processes/GPUs. For direct Python bindings, consider `nccl4py` for explicit NCCL API calls.

import os
import torch
import torch.distributed as dist

# This quickstart assumes a multi-process setup, typically launched
# via torch.distributed.launch or mpirun, where each process
# runs this script with a unique rank and world_size.

# Example environment variables (set by launch utility):
# os.environ['MASTER_ADDR'] = os.environ.get('MASTER_ADDR', 'localhost')
# os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '29500')
# os.environ['RANK'] = os.environ.get('RANK', '0')
# os.environ['WORLD_SIZE'] = os.environ.get('WORLD_SIZE', '1')

def run_distributed_example(rank, world_size):
    # Initialize the process group with NCCL backend
    print(f"Initializing process group for rank {rank}/{world_size-1}...")
    dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)
    print(f"Process group initialized on rank {rank}.")

    # Set device for the current process
    torch.cuda.set_device(rank)

    # Create a tensor on the GPU
    tensor = torch.ones(10, device=f'cuda:{rank}') * (rank + 1)
    print(f"Rank {rank}: Initial tensor value: {tensor}")

    # Perform an all_reduce operation (summing tensors across all GPUs)
    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)

    print(f"Rank {rank}: Tensor after all_reduce: {tensor}")

    # Clean up the process group
    dist.destroy_process_group()
    print(f"Rank {rank}: Process group destroyed.")

# To run this, you would typically use:
# python -m torch.distributed.launch --nproc_per_node=2 your_script.py
# Or set environment variables and run:
# MASTER_ADDR=localhost MASTER_PORT=29500 RANK=0 WORLD_SIZE=2 python your_script.py
# MASTER_ADDR=localhost MASTER_PORT=29500 RANK=1 WORLD_SIZE=2 python your_script.py

# For simplicity, if running as a single process for structural check:
if __name__ == '__main__':
    # In a real scenario, rank and world_size would be provided by a launcher.
    # This block is for structural demonstration only and will not perform
    # actual distributed communication without a proper launcher.
    try:
        rank = int(os.environ.get('RANK', '0'))
        world_size = int(os.environ.get('WORLD_SIZE', '1'))
        if torch.cuda.is_available() and world_size > 0:
             run_distributed_example(rank, world_size)
        else:
             print("CUDA not available or world_size is 0. Cannot run distributed example.")
    except RuntimeError as e:
        print(f"Error initializing distributed environment: {e}. This often happens if not run with a proper distributed launcher like torch.distributed.launch.")