nvshmem4py-cu12
raw JSON → 0.3.0 verified Mon Apr 27 auth: no python
Python bindings for NVSHMEM (NVIDIA's implementation of OpenSHMEM for GPUs). Version 0.3.0 requires Python >=3.9 and CUDA 12.x. This package enables peer-to-peer GPU communication across NVLink and InfiniBand. Under active development with frequent breaking changes.
pip install nvshmem4py-cu12 Common errors
error ImportError: No module named 'nvshmem' ↓
cause Trying to import 'nvshmem' without installing the package or using the old name 'nvshmem4py'.
fix
Install nvshmem4py-cu12: pip install nvshmem4py-cu12. Then use 'from nvshmem import init'.
error nvshmem4py-cu12 requires CuPy with CUDA 12.x. ImportError: No module named 'cupy' ↓
cause CuPy for CUDA 12 is not installed.
fix
Install cupy-cuda12x: pip install cupy-cuda12x
error RuntimeError: NVSHMEM internal error: invalid symmetric memory region ↓
cause Using non-symmetric memory (e.g., numpy arrays) in NVSHMEM operations.
fix
Allocate memory via CuPy (cp.empty) or directly via nvshmem.shmalloc.
error RuntimeError: NVSHMEM not initialized. Call nvshmem.init() first. ↓
cause Calling NVSHMEM functions before init().
fix
Call nvshmem.init() before any other NVSHMEM function.
Warnings
breaking NVSHMEM must be initialized after MPI_Init or equivalent. Calling init() before MPI will cause undefined behavior. ↓
fix Ensure MPI_Init is called before nvshmem.init() in the same process.
breaking The library name changed from nvshmem4py to nvshmem (or nvshmem4py-cu12 for the CUDA 12 variant). Importing 'nvshmem4py' directly will fail. ↓
fix Use 'from nvshmem import ...' instead of 'from nvshmem4py import ...'.
deprecated The function nvshmem.sync_all() is deprecated. Use barrier() for synchronizing all PEs. ↓
fix Replace sync_all() calls with barrier().
gotcha Symmetric memory must be allocated via CuPy or other supported allocator that respects NVSHMEM's memory pool. Using raw cudaMalloc may lead to errors. ↓
fix Use cp.empty() or cp.zeros() to allocate device arrays for NVSHMEM operations.
gotcha NVSHMEM operations require all processes to participate. Missing a barrier between collective operations can cause deadlocks. ↓
fix Always synchronize with barrier() after each phase of communication.
Imports
- init
from nvshmem import init - barrier
from nvshmem import barrier - my_pe_n
from nvshmem import my_pe_n - n_pes
from nvshmem import n_pes
Quickstart
import os
import cupy as cp
from nvshmem import init, barrier, my_pe_n, n_pes
# Initialize NVSHMEM (must be called after MPI_Init or similar)
init()
rank = my_pe_n()
nranks = n_pes()
# Allocate symmetric memory on GPU
buf = cp.empty(1024, dtype=cp.float32)
# Barrier to synchronize
barrier()
print(f"Rank {rank}/{nranks} ready.", flush=True)
# Example: send data from rank 0 to rank 1 (if nranks > 1)
if nranks > 1:
if rank == 0:
buf[:] = 1.0
nvshmem.putmem(buf.data.ptr, 1, 0, 1024 * 4) # put to rank 1
elif rank == 1:
nvshmem.getmem(buf.data.ptr, 0, 0, 1024 * 4) # get from rank 0
barrier()
print(f"Rank {rank} finished.", flush=True)