{"id":3186,"library":"nvshmem4py-cu13","title":"Python bindings for NVSHMEM","description":"NVSHMEM4Py is the official Python language binding for NVSHMEM, a high-performance parallel programming interface based on OpenSHMEM. It provides a Pythonic interface to NVSHMEM's functionality, enabling applications to leverage the Partitioned Global Address Space (PGAS) programming model for efficient multi-GPU and multi-node communication. Key features include seamless integration with NumPy, CuPy, and PyTorch, symmetric memory management, and support for one-sided communication operations (put/get, collectives, atomics) and synchronization primitives. The library `nvshmem4py-cu13` specifically targets CUDA 13.x. The project demonstrates a healthy version release cadence, with the latest version 0.3.0 released in March 2026.","status":"active","version":"0.3.0","language":"en","source_language":"en","source_url":"https://github.com/NVIDIA/nvshmem","tags":["HPC","GPU","CUDA","NVSHMEM","distributed computing","PGAS","Python bindings","scientific computing"],"install":[{"cmd":"pip install nvshmem4py-cu13 nvidia-nvshmem-cu13","lang":"bash","label":"Recommended Installation"}],"dependencies":[{"reason":"Provides the underlying NVSHMEM C/C++ library for CUDA 13.x.","package":"nvidia-nvshmem-cu13","optional":false},{"reason":"Interoperability with NumPy arrays for symmetric memory management.","package":"numpy","optional":false},{"reason":"Used for certain initialization methods (e.g., MPI Comm-based) and testing frameworks.","package":"mpi4py","optional":true},{"reason":"Seamless interoperability and NVSHMEM operations with CuPy arrays and tensors.","package":"cupy","optional":true},{"reason":"Seamless interoperability and NVSHMEM operations with PyTorch tensors.","package":"torch","optional":true},{"reason":"Enables writing fused compute-communication GPU kernels in Python using Numba's CUDA DSL.","package":"numba-cuda","optional":true},{"reason":"Enables interoperability with Triton-expressed GPU kernels.","package":"triton","optional":true}],"imports":[{"symbol":"nvshmem","correct":"import nvshmem.core as nvshmem"}],"quickstart":{"code":"import nvshmem.core as nvshmem\nimport os\n\ndef main():\n    # Initialize NVSHMEM environment\n    if not nvshmem.is_initialized():\n        nvshmem.init()\n\n    # Query current Processing Element (PE) ID and total number of PEs\n    my_pe = nvshmem.my_pe()\n    n_pes = nvshmem.n_pes()\n\n    print(f\"Hello from PE {my_pe} of {n_pes}!\")\n\n    # Finalize NVSHMEM environment\n    nvshmem.finalize()\n\nif __name__ == \"__main__\":\n    # This example must be launched with an MPI runner, e.g.:\n    # mpirun -np 2 python your_script_name.py\n    main()\n","lang":"python","description":"This quickstart demonstrates basic initialization, querying the Processing Element (PE) ID and total number of PEs, and finalization of the NVSHMEM environment. NVSHMEM is a multi-process library, so applications typically need to be launched with an MPI runner (e.g., `mpirun`)."},"warnings":[{"fix":"Ensure the CUDA toolkit's include directory is in your compiler's search path (e.g., `export CPPFLAGS=\"-I/usr/local/cuda/include\"` or `export CPATH=\"/usr/local/cuda/include:$CPATH\"` before pip installation).","message":"Installation issues due to missing CUDA runtime API headers. NVSHMEM4Py's internal bindings require the CUDA runtime API headers to be available in the compiler's include path during installation (e.g., via `CPPFLAGS` or `CPATH`).","severity":"gotcha","affected_versions":"All versions"},{"fix":"Explicitly call `nvshmem.core.free_tensor(tensor_obj)` for all allocated symmetric tensors or buffers to ensure they are deallocated before `nvshmem.core.finalize()` is called. Consider upgrading to the latest version to benefit from potential fixes.","message":"Potential hangs or errors from `nvshmem.core.finalize()` if symmetric memory buffers have multiple references and are not explicitly freed before finalization. The internal buffer tracking might not fully deallocate all resources if reference counts are above one, leading to issues if Python's garbage collector attempts to free them after NVSHMEM is finalized.","severity":"gotcha","affected_versions":"Versions <= 0.2.1 (issue discussed and acknowledged, fix expected in later releases)."},{"fix":"Refer to the latest official documentation for `nvshmem.core.rma.quiet` and explicitly use CUDA stream synchronization primitives (e.g., `cudaStreamSynchronize`) or NVSHMEM's collective synchronization APIs for remote completion guarantees if needed.","message":"Misinterpretation of `nvshmem.core.rma.quiet` semantics: Older documentation incorrectly implied that `nvshmem.core.rma.quiet` guaranteed remote completion for RMA operations, similar to `shmem_quiet` in OpenSHMEM. In reality, it only ensures local completion; remote completion is guaranteed by a stream synchronization.","severity":"gotcha","affected_versions":"Documentation prior to fix in 0.3.0 release."},{"fix":"Ensure all buffers used in NVSHMEM RMA or atomic operations are allocated using `nvshmem.malloc`, `nvshmem.calloc`, or explicitly registered with `nvshmemx_buffer_register_symmetric`.","message":"InfiniBand (IB) failures (Remote Protection Error / Local Protection Error) when using non-symmetric heap addresses. Attempting RMA or atomic operations on memory addresses not allocated via NVSHMEM's symmetric heap or not registered as local buffers will result in protection errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Avoid blocking CUDA calls on the default stream (stream 0) within performance-critical loops. Use non-blocking operations and explicit stream synchronization (e.g., with `cudaStreamCreate` and `cudaStreamSynchronize` on specific streams) to manage dependencies.","message":"Blocking CUDA calls on stream 0 (e.g., `cudaDeviceSynchronize`, `cudaMemcpy`) in the iterative phase of an application can lead to hangs, particularly in NVSHMEM programs.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}