cuMM: CUda Matrix Multiply Library

raw JSON →
0.8.2 verified Fri May 01 auth: no python

cuMM is a high-performance CUDA matrix multiplication library designed for deep learning and scientific computing. It provides optimized GEMM (General Matrix Multiply) kernels and supports various precision formats. Version 0.8.2 requires Python >=3.8 and is actively maintained.

pip install cumm-cu126
error ModuleNotFoundError: No module named 'cumm'
cause Installed package 'cumm-cu126' but Python cannot find the module due to missing dependencies or incorrect import. Also, the module name is exactly 'cumm' (no hyphen).
fix
Run 'pip install cumm-cu126' and ensure you use 'import cumm' (no hyphen). Check that CUDA toolkit 12.6 is available.
error RuntimeError: CUDA error: no kernel image is available for execution on the device
cause The GPU architecture is not supported by the precompiled kernels in cuMM. cuMM ships kernels for specific compute capabilities (e.g., sm_80, sm_86, sm_89, sm_90). Older or newer GPUs may not have a matching kernel.
fix
Use a supported GPU (e.g., NVIDIA Ampere, Ada Lovelace, Hopper) or rebuild cuMM from source with the appropriate architecture flags.
error ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
cause CUDA runtime library (libcudart.so.12) is not installed or not in the library path.
fix
Install CUDA 12.6 toolkit and add its lib64 directory to LD_LIBRARY_PATH.
breaking cuMM requires a compatible CUDA toolkit (CUDA 12.6) and NVIDIA GPU drivers. Running on an unsupported CUDA version may cause import errors or runtime crashes.
fix Ensure your system has CUDA 12.6 installed and set LD_LIBRARY_PATH appropriately.
gotcha The library name on PyPI is 'cumm-cu126', but the Python module to import is simply 'cumm'. Do not use the PyPI name in import statements.
fix Use 'import cumm' instead of 'import cumm-cu126'.
deprecated cuMM versions before 0.7.0 used a different API with explicit gemm_ functions. The new API uses cumm.gemm directly.
fix Upgrade to 0.8.2 and replace cumm.gemm_xx with cumm.gemm.

Basic GEMM operation using cuMM with PyTorch tensors.

import cumm
import torch
x = torch.randn(128, 128, device='cuda')
y = torch.randn(128, 128, device='cuda')
z = cumm.gemm(x, y)
print(z.shape)