nvshmem4py-cu12

0.3.0 verified Mon Apr 27 auth: no python

Python bindings for NVSHMEM (NVIDIA's implementation of OpenSHMEM for GPUs). Version 0.3.0 requires Python >=3.9 and CUDA 12.x. This package enables peer-to-peer GPU communication across NVLink and InfiniBand. Under active development with frequent breaking changes.

pip install nvshmem4py-cu12

Common errors

error ImportError: No module named 'nvshmem' ↓

cause Trying to import 'nvshmem' without installing the package or using the old name 'nvshmem4py'.

fix

Install nvshmem4py-cu12: pip install nvshmem4py-cu12. Then use 'from nvshmem import init'.

error nvshmem4py-cu12 requires CuPy with CUDA 12.x. ImportError: No module named 'cupy' ↓

cause CuPy for CUDA 12 is not installed.

fix

Install cupy-cuda12x: pip install cupy-cuda12x

error RuntimeError: NVSHMEM internal error: invalid symmetric memory region ↓

cause Using non-symmetric memory (e.g., numpy arrays) in NVSHMEM operations.

fix

Allocate memory via CuPy (cp.empty) or directly via nvshmem.shmalloc.

error RuntimeError: NVSHMEM not initialized. Call nvshmem.init() first. ↓

cause Calling NVSHMEM functions before init().

fix

Call nvshmem.init() before any other NVSHMEM function.

Warnings

breaking NVSHMEM must be initialized after MPI_Init or equivalent. Calling init() before MPI will cause undefined behavior. ↓

fix Ensure MPI_Init is called before nvshmem.init() in the same process.

breaking The library name changed from nvshmem4py to nvshmem (or nvshmem4py-cu12 for the CUDA 12 variant). Importing 'nvshmem4py' directly will fail. ↓

fix Use 'from nvshmem import ...' instead of 'from nvshmem4py import ...'.

deprecated The function nvshmem.sync_all() is deprecated. Use barrier() for synchronizing all PEs. ↓

fix Replace sync_all() calls with barrier().

gotcha Symmetric memory must be allocated via CuPy or other supported allocator that respects NVSHMEM's memory pool. Using raw cudaMalloc may lead to errors. ↓

fix Use cp.empty() or cp.zeros() to allocate device arrays for NVSHMEM operations.

gotcha NVSHMEM operations require all processes to participate. Missing a barrier between collective operations can cause deadlocks. ↓

fix Always synchronize with barrier() after each phase of communication.

Imports

init
```
from nvshmem import init
```
nvshmem.init() is correct. Some old examples used 'from nvshmem4py import init' which no longer works.
barrier
```
from nvshmem import barrier
```
my_pe_n
```
from nvshmem import my_pe_n
```
n_pes
```
from nvshmem import n_pes
```

Quickstart

Initialize NVSHMEM, allocate symmetric GPU memory, perform put/get, and barrier.

import os
import cupy as cp
from nvshmem import init, barrier, my_pe_n, n_pes

# Initialize NVSHMEM (must be called after MPI_Init or similar)
init()

rank = my_pe_n()
nranks = n_pes()

# Allocate symmetric memory on GPU
buf = cp.empty(1024, dtype=cp.float32)

# Barrier to synchronize
barrier()

print(f"Rank {rank}/{nranks} ready.", flush=True)

# Example: send data from rank 0 to rank 1 (if nranks > 1)
if nranks > 1:
    if rank == 0:
        buf[:] = 1.0
        nvshmem.putmem(buf.data.ptr, 1, 0, 1024 * 4)  # put to rank 1
    elif rank == 1:
        nvshmem.getmem(buf.data.ptr, 0, 0, 1024 * 4)  # get from rank 0
    barrier()

print(f"Rank {rank} finished.", flush=True)