NVIDIA NVSHMEM (nvshmem4py) - CUDA 12

raw JSON →
3.6.5 verified Tue May 12 auth: no python install: stale

NVIDIA NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs, providing a Partitioned Global Address Space (PGAS) for efficient and scalable communication in GPU clusters. The `nvidia-nvshmem-cu12` package provides the official Python bindings (NVSHMEM4Py) for CUDA 12.x compatible environments, enabling Python applications to leverage NVSHMEM's high-performance communication model. The current version is 3.6.5, with releases typically occurring several times a year to align with NVSHMEM and CUDA toolkit updates.

pip install nvidia-nvshmem-cu12
error ModuleNotFoundError: No module named 'nvshmem'
cause This error occurs when the 'nvshmem' Python module cannot be found, typically because the `nvshmem4py` package, which provides the Python bindings for NVSHMEM, is not installed or the environment is not correctly configured to find it.
fix
Ensure the correct nvshmem4py package for your CUDA version and nvidia-nvshmem-cu12 are installed using pip: pip install nvshmem4py-cu12 nvidia-nvshmem-cu12 (for CUDA 12.x).
error building 'nvshmem.bindings.nvshmem' extension creating build ... (followed by errors like 'cannot find CUDA runtime headers')
cause This error indicates a compilation failure during the installation of `nvshmem4py` (often when building from source or certain wheel installations) because the CUDA runtime API headers are not accessible in the compiler's include path.
fix
Set the CUDA_HOME environment variable to point to your CUDA Toolkit installation and ensure its include directory is in your compiler's search path, for example, by setting export CPPFLAGS="-I$CUDA_HOME/include" or export CPATH="$CUDA_HOME/include:$CPATH".
error What does the following runtime error indicate? ... A: NVSHMEM uses dynamically loaded bootstrap modules for several bootstraps, including MPI, OpenSHMEM, and PMIx. The above error indicates that the bootstrap module for MPI could not be loaded. Ensure that the NVSHMEM library directory is in the system search path for the dynamic linker or that the LD_LIBRARY_PATH variable includes the NVSHMEM library directory.
cause This runtime error (or similar 'undefined symbol' errors) occurs when the system's dynamic linker cannot find the necessary NVSHMEM native shared libraries (`.so` files) at runtime, often because the `LD_LIBRARY_PATH` environment variable does not include the directory where NVSHMEM was installed.
fix
Set the LD_LIBRARY_PATH environment variable to include the path to the NVSHMEM library directory. For example, if NVSHMEM is installed in /usr/local, you might use export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH.
breaking Internal layout changes in RC-connected Queue Pairs (QPs) starting in NVSHMEM 3.5.19 caused ABI compatibility breakage when enabling InfiniBand GPUDirect Async (IBGDA). This affects custom builds or specific configurations leveraging IBGDA.
fix Users enabling IBGDA should review NVSHMEM release notes for compatibility or rebuild applications against the specific NVSHMEM version they are using. Upgrading NVSHMEM may require recompiling dependent libraries.
gotcha NVSHMEM (including its device-side APIs) and libraries that utilize NVSHMEM can typically only be built and linked as static libraries. This is due to limitations in how CUDA device symbols are linked across shared libraries, which is not supported.
fix When building applications that use NVSHMEM, ensure that NVSHMEM components are linked statically. Avoid attempts to link `libnvshmem.so` as a shared library if encountered in lower-level C/C++ development.
gotcha Prior to CUDA driver versions 460.106.00 (or later 470+), NVSHMEM might not be able to allocate the complete device memory due to issues with reusing BAR1 space. This can lead to memory allocation failures or unexpected behavior.
fix Ensure your NVIDIA GPU drivers are updated to version 460.106.00 or later (or 470+ for the 460 branch) to resolve BAR1 memory allocation issues.
gotcha NVSHMEM is not officially supported in virtualized environments (VMs). Using it in such environments may lead to unexpected behavior, performance degradation, or outright failures.
fix Run NVSHMEM applications on bare-metal systems for full support and expected behavior.
gotcha When `pip install nvidia-nvshmem-cu12` needs to compile Cython source code (e.g., if a pre-built wheel is not available), the CUDA runtime API headers must be accessible in the compiler's include path. Failure to do so results in compilation errors like 'Failed building wheel for nvshmem4py-cu12'.
fix Set environment variables like `CPATH` or `CPPFLAGS` to include your CUDA toolkit's include directory (e.g., `-I/usr/local/cuda/include`) before running `pip install`.
deprecated Support for the active set-based collectives interface in OpenSHMEM has been removed. Older applications relying on this interface will no longer function as expected.
fix Migrate applications to use team-based collectives (e.g., `NVSHMEM_TEAM_WORLD`) as per the current OpenSHMEM specification and NVSHMEM best practices.
gotcha Installing `nvidia-nvshmem-cu12` may fail with 'No matching distribution found' because pre-built wheels for the package are not available on PyPI for the specific Python version (e.g., 3.13) or operating system/architecture (e.g., Alpine Linux). NVIDIA NVSHMEM wheels are typically compiled against specific Python and CUDA versions and may not be immediately available for newly released Python versions or less common platforms.
fix Ensure you are using a Python version and operating system for which `nvidia-nvshmem-cu12` wheels are officially published on PyPI. Check the official NVIDIA NVSHMEM documentation or PyPI project page for supported configurations. If no pre-built wheels are available, you may need to compile the package from source, if supported, which might require additional development dependencies.
python os / libc status wheel install import disk mem side effects
3.10 alpine (musl) build_error - - - - - -
3.10 alpine (musl) - - - - - -
3.10 slim (glibc) wheel 4.8s - 248M - broken
3.10 slim (glibc) - - - - - -
3.11 alpine (musl) build_error - - - - - -
3.11 alpine (musl) - - - - - -
3.11 slim (glibc) wheel 4.4s - 249M - broken
3.11 slim (glibc) - - - - - -
3.12 alpine (musl) build_error - - - - - -
3.12 alpine (musl) - - - - - -
3.12 slim (glibc) wheel 4.8s - 241M - broken
3.12 slim (glibc) - - - - - -
3.13 alpine (musl) build_error - - - - - -
3.13 alpine (musl) - - - - - -
3.13 slim (glibc) wheel 4.2s - 241M - broken
3.13 slim (glibc) - - - - - -
3.9 alpine (musl) build_error - - - - - -
3.9 alpine (musl) - - - - - -
3.9 slim (glibc) wheel 5.7s - 247M - broken
3.9 slim (glibc) - - - - - -

This quickstart demonstrates the basic initialization and finalization of NVSHMEM within a Python program using `nvshmem4py`. It also shows how to query the current PE (processing element) ID and the total number of PEs. Note that NVSHMEM operations are collective and require the script to be launched via a parallel environment, such as `mpiexec` (from MPI) or `nvshmrun` (provided with NVSHMEM), to correctly allocate and coordinate multiple PEs across GPUs. Running the script directly with `python` will lead to an error if not launched collectively.

import nvshmem.core as nvshmem
import os

def main():
    # Initialize NVSHMEM. This is a collective operation.
    # In a real scenario, this script would be launched with `mpiexec` or `nvshmrun`.
    nvshmem.init()

    # Query PE information
    my_pe = nvshmem.my_pe()
    n_pes = nvshmem.n_pes()

    print(f"Hello from PE {my_pe} of {n_pes}")

    # Perform some simple collective (e.g., a barrier)
    # This ensures all PEs reach this point before proceeding
    nvshmem.barrier_all()

    # Finalize NVSHMEM. This is also a collective operation.
    nvshmem.finalize()

if __name__ == '__main__':
    # Note: This script needs to be run using an MPI launcher (e.g., mpiexec -n 2 python your_script.py)
    # or NVSHMEM's own launcher (nvshmrun). Running directly 'python your_script.py'
    # will result in an error or hang if NVSHMEM expects multiple processes.
    try:
        main()
    except Exception as e:
        # Catch potential errors if not launched collectively, for a more graceful exit
        print(f"Error: {e}")
        print("Please ensure the script is launched collectively, e.g., 'mpiexec -n 2 python quickstart.py'")