nvshmem4py-cu12

raw JSON →
0.3.0 verified Mon Apr 27 auth: no python

Python bindings for NVSHMEM (NVIDIA's implementation of OpenSHMEM for GPUs). Version 0.3.0 requires Python >=3.9 and CUDA 12.x. This package enables peer-to-peer GPU communication across NVLink and InfiniBand. Under active development with frequent breaking changes.

pip install nvshmem4py-cu12
error ImportError: No module named 'nvshmem'
cause Trying to import 'nvshmem' without installing the package or using the old name 'nvshmem4py'.
fix
Install nvshmem4py-cu12: pip install nvshmem4py-cu12. Then use 'from nvshmem import init'.
error nvshmem4py-cu12 requires CuPy with CUDA 12.x. ImportError: No module named 'cupy'
cause CuPy for CUDA 12 is not installed.
fix
Install cupy-cuda12x: pip install cupy-cuda12x
error RuntimeError: NVSHMEM internal error: invalid symmetric memory region
cause Using non-symmetric memory (e.g., numpy arrays) in NVSHMEM operations.
fix
Allocate memory via CuPy (cp.empty) or directly via nvshmem.shmalloc.
error RuntimeError: NVSHMEM not initialized. Call nvshmem.init() first.
cause Calling NVSHMEM functions before init().
fix
Call nvshmem.init() before any other NVSHMEM function.
breaking NVSHMEM must be initialized after MPI_Init or equivalent. Calling init() before MPI will cause undefined behavior.
fix Ensure MPI_Init is called before nvshmem.init() in the same process.
breaking The library name changed from nvshmem4py to nvshmem (or nvshmem4py-cu12 for the CUDA 12 variant). Importing 'nvshmem4py' directly will fail.
fix Use 'from nvshmem import ...' instead of 'from nvshmem4py import ...'.
deprecated The function nvshmem.sync_all() is deprecated. Use barrier() for synchronizing all PEs.
fix Replace sync_all() calls with barrier().
gotcha Symmetric memory must be allocated via CuPy or other supported allocator that respects NVSHMEM's memory pool. Using raw cudaMalloc may lead to errors.
fix Use cp.empty() or cp.zeros() to allocate device arrays for NVSHMEM operations.
gotcha NVSHMEM operations require all processes to participate. Missing a barrier between collective operations can cause deadlocks.
fix Always synchronize with barrier() after each phase of communication.

Initialize NVSHMEM, allocate symmetric GPU memory, perform put/get, and barrier.

import os
import cupy as cp
from nvshmem import init, barrier, my_pe_n, n_pes

# Initialize NVSHMEM (must be called after MPI_Init or similar)
init()

rank = my_pe_n()
nranks = n_pes()

# Allocate symmetric memory on GPU
buf = cp.empty(1024, dtype=cp.float32)

# Barrier to synchronize
barrier()

print(f"Rank {rank}/{nranks} ready.", flush=True)

# Example: send data from rank 0 to rank 1 (if nranks > 1)
if nranks > 1:
    if rank == 0:
        buf[:] = 1.0
        nvshmem.putmem(buf.data.ptr, 1, 0, 1024 * 4)  # put to rank 1
    elif rank == 1:
        nvshmem.getmem(buf.data.ptr, 0, 0, 1024 * 4)  # get from rank 0
    barrier()

print(f"Rank {rank} finished.", flush=True)