{"id":1595,"library":"nvidia-nccl-cu13","title":"NVIDIA Collective Communication Library (NCCL) Runtime","description":"The `nvidia-nccl-cu13` package provides the NVIDIA Collective Communication Library (NCCL) runtime specific to CUDA 13.x. NCCL is a library of standard routines for inter-GPU communication, optimized for NVIDIA GPUs. It is primarily used as a backend by deep learning frameworks like PyTorch and TensorFlow for distributed training on multi-GPU systems. This package does not expose a direct Python API for end-users but provides the necessary shared libraries. It's released in conjunction with NVIDIA CUDA Toolkit versions.","status":"active","version":"2.29.7","language":"en","source_language":"en","source_url":"https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html","tags":["nvidia","cuda","deep-learning","distributed-computing","nccl","runtime","gpu","pytorch","tensorflow"],"install":[{"cmd":"pip install nvidia-nccl-cu13","lang":"bash","label":"Install NCCL runtime for CUDA 13"}],"dependencies":[{"reason":"Provides the core CUDA runtime libraries for CUDA 13, which NCCL depends on.","package":"nvidia-cuda-runtime-cu13","optional":false},{"reason":"Commonly used with PyTorch for distributed training, which leverages NCCL internally.","package":"torch","optional":true},{"reason":"Commonly used with TensorFlow for distributed training, which can leverage NCCL internally.","package":"tensorflow","optional":true}],"imports":[{"note":"Users typically do not 'import nccl' directly. Instead, frameworks like `torch.distributed` or `tf.distribute` will utilize the NCCL libraries provided by this package internally for multi-GPU communication.","symbol":"NCCL Runtime (indirect usage)","correct":"This package primarily provides shared library files (e.g., libnccl.so) that deep learning frameworks (like PyTorch or TensorFlow) link against. It does NOT expose a direct Python API for end-user import."}],"quickstart":{"code":"import os\nimport torch\nimport torch.distributed as dist\nfrom torch.nn.parallel import DistributedDataParallel as DDP\n\ndef setup(rank, world_size):\n    os.environ['MASTER_ADDR'] = os.environ.get('MASTER_ADDR', 'localhost')\n    os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '29500')\n    dist.init_process_group(\"nccl\", rank=rank, world_size=world_size)\n\ndef cleanup():\n    dist.destroy_process_group()\n\nclass ToyModel(torch.nn.Module):\n    def __init__(self):\n        super(ToyModel, self).__init__()\n        self.net1 = torch.nn.Linear(10, 10)\n        self.relu = torch.nn.ReLU()\n        self.net2 = torch.nn.Linear(10, 5)\n\n    def forward(self, x):\n        return self.net2(self.relu(self.net1(x)))\n\ndef demo_basic(rank, world_size):\n    print(f\"Running basic DDP example on rank {rank}.\")\n    setup(rank, world_size)\n\n    # Use a GPU if available, otherwise CPU (though NCCL requires GPUs)\n    device = torch.device(f'cuda:{rank}' if torch.cuda.is_available() else 'cpu')\n    model = ToyModel().to(device)\n    ddp_model = DDP(model, device_ids=[rank] if torch.cuda.is_available() else None)\n\n    loss_fn = torch.nn.MSELoss()\n    optimizer = torch.optim.SGD(ddp_model.parameters(), lr=0.001)\n\n    for _ in range(3):\n        inputs = torch.randn(20, 10).to(device)\n        labels = torch.randn(20, 5).to(device)\n        optimizer.zero_grad()\n        outputs = ddp_model(inputs)\n        loss = loss_fn(outputs, labels)\n        loss.backward()\n        optimizer.step()\n        if rank == 0: # Only print from rank 0 to avoid floods\n            print(f\"Rank {rank}, Loss: {loss.item():.4f}\")\n\n    cleanup()\n\nif __name__ == \"__main__\":\n    # This example requires multiple processes to run.\n    # You would typically run this using torch.distributed.launch or torchrun:\n    # python -m torch.distributed.run --nproc_per_node=2 your_script.py\n    # For a single-process 'dry run' for syntax:\n    # Note: NCCL backend will fail if not run in a multi-GPU DDP setup.\n    # world_size = 1 # For dry-run, will likely fail with NCCL backend\n    # rank = 0\n    # demo_basic(rank, world_size)\n    print(\"This script demonstrates NCCL usage via PyTorch DDP.\")\n    print(\"To run, execute with `torchrun --nproc_per_node=<num_gpus> your_script.py`\")\n    print(\"e.g., `torchrun --nproc_per_node=2 quickstart.py`\")","lang":"python","description":"This quickstart demonstrates how NCCL is implicitly used by PyTorch for distributed data parallel (DDP) training across multiple GPUs. The `nvidia-nccl-cu13` package provides the underlying `libnccl.so` library that `torch.distributed` links against when `dist.init_process_group` is called with the 'nccl' backend. The code sets up a minimal DDP training loop. You would run this script using `torchrun` (part of PyTorch) to launch multiple processes, each assigned to a GPU."},"warnings":[{"fix":"Do not attempt to import this package directly for API access. Instead, ensure it is installed when using frameworks like PyTorch or TensorFlow for multi-GPU training, as they will use it automatically.","message":"This package is a runtime dependency and does NOT expose a direct Python API. You typically won't `import nvidia_nccl` or `import nccl` in your Python code. Its functionality is leveraged internally by higher-level deep learning frameworks.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure that your system's CUDA toolkit version, the `nvidia-cuda-runtime-cu<version>` package, and the `nvidia-nccl-cu<version>` package all match. If you are using PyTorch or TensorFlow, check which CUDA version they were compiled with and install the corresponding `nvidia-nccl-cu<version>` package.","message":"CUDA Version Mismatch: The `nvidia-nccl-cu13` package is specifically compiled for CUDA 13.x. Using it with a different CUDA version (e.g., CUDA 12.x or 11.x) installed on your system or expected by your deep learning framework can lead to runtime errors (e.g., `_nccl_create_comm` failed, symbol lookup errors).","severity":"breaking","affected_versions":"All versions, specifically when interacting with system CUDA or framework builds."},{"fix":"Prioritize matching the `nvidia-nccl-cu13` package version with the CUDA version targeted by your deep learning framework. If issues arise, check the framework's documentation regarding its NCCL dependency and consider using a specific environment (e.g., Conda) to isolate dependencies.","message":"Conflicts with Framework-Bundled NCCL: Some deep learning frameworks (e.g., PyTorch, TensorFlow) might ship with their own pre-compiled NCCL libraries, or they might expect a specific version of NCCL installed globally. This can lead to conflicts if the `nvidia-nccl-cu13` package's version doesn't align with the framework's expectation.","severity":"gotcha","affected_versions":"All versions, especially when managing multiple environments or different framework builds."}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}