Python bindings for NVSHMEM
NVSHMEM4Py is the official Python language binding for NVSHMEM, a high-performance parallel programming interface based on OpenSHMEM. It provides a Pythonic interface to NVSHMEM's functionality, enabling applications to leverage the Partitioned Global Address Space (PGAS) programming model for efficient multi-GPU and multi-node communication. Key features include seamless integration with NumPy, CuPy, and PyTorch, symmetric memory management, and support for one-sided communication operations (put/get, collectives, atomics) and synchronization primitives. The library `nvshmem4py-cu13` specifically targets CUDA 13.x. The project demonstrates a healthy version release cadence, with the latest version 0.3.0 released in March 2026.
Warnings
- gotcha Installation issues due to missing CUDA runtime API headers. NVSHMEM4Py's internal bindings require the CUDA runtime API headers to be available in the compiler's include path during installation (e.g., via `CPPFLAGS` or `CPATH`).
- gotcha Potential hangs or errors from `nvshmem.core.finalize()` if symmetric memory buffers have multiple references and are not explicitly freed before finalization. The internal buffer tracking might not fully deallocate all resources if reference counts are above one, leading to issues if Python's garbage collector attempts to free them after NVSHMEM is finalized.
- gotcha Misinterpretation of `nvshmem.core.rma.quiet` semantics: Older documentation incorrectly implied that `nvshmem.core.rma.quiet` guaranteed remote completion for RMA operations, similar to `shmem_quiet` in OpenSHMEM. In reality, it only ensures local completion; remote completion is guaranteed by a stream synchronization.
- gotcha InfiniBand (IB) failures (Remote Protection Error / Local Protection Error) when using non-symmetric heap addresses. Attempting RMA or atomic operations on memory addresses not allocated via NVSHMEM's symmetric heap or not registered as local buffers will result in protection errors.
- gotcha Blocking CUDA calls on stream 0 (e.g., `cudaDeviceSynchronize`, `cudaMemcpy`) in the iterative phase of an application can lead to hangs, particularly in NVSHMEM programs.
Install
-
pip install nvshmem4py-cu13 nvidia-nvshmem-cu13
Imports
- nvshmem
import nvshmem.core as nvshmem
Quickstart
import nvshmem.core as nvshmem
import os
def main():
# Initialize NVSHMEM environment
if not nvshmem.is_initialized():
nvshmem.init()
# Query current Processing Element (PE) ID and total number of PEs
my_pe = nvshmem.my_pe()
n_pes = nvshmem.n_pes()
print(f"Hello from PE {my_pe} of {n_pes}!")
# Finalize NVSHMEM environment
nvshmem.finalize()
if __name__ == "__main__":
# This example must be launched with an MPI runner, e.g.:
# mpirun -np 2 python your_script_name.py
main()