NVIDIA CUBLAS Runtime Libraries for Python (cu11)
The `nvidia-cublas-cu11` package provides the native CUBLAS runtime libraries for NVIDIA GPUs, specifically for CUDA 11 environments. CUBLAS is NVIDIA's highly optimized implementation of BLAS (Basic Linear Algebra Subprograms) which is critical for accelerating AI and HPC workloads. This package allows Python environments to access GPU computational resources for linear algebra operations, typically as a dependency for higher-level frameworks like PyTorch, TensorFlow, or through wrappers like Numba. The current version is 11.11.3.6, with releases generally aligned with CUDA Toolkit updates and subsequent patch releases.
Warnings
- breaking Mismatching `nvidia-cublas-cu11` versions with your installed NVIDIA CUDA Toolkit can lead to runtime errors, undefined behavior, or application crashes. CUBLAS versions are tightly coupled with CUDA versions.
- gotcha This package primarily provides native shared libraries (`.so`, `.dll`) and is not intended for direct Python import or interaction. Users typically access CUBLAS functionality indirectly via higher-level Python libraries like Numba, PyTorch, or TensorFlow, which bind to these underlying C/C++ libraries. Attempting `import cublas` will fail.
- gotcha Applications relying on CUBLAS require proper environment variable configuration, especially `LD_LIBRARY_PATH` (on Linux) or system PATH (on Windows), to include the directory containing `libcublas.so` (or `cublas.dll`). Incorrect paths can lead to 'library not found' errors.
- gotcha Memory allocation errors or `CUBLAS_STATUS_INVALID_VALUE` are common when GPU memory is insufficient or if kernel parameters are incorrect. Ensure your GPU has enough memory for the operation.
Install
-
pip install nvidia-cublas-cu11
Imports
- cuBLAS functionality
import numba.cuda; from numba import float32, float64; # Indirect access via other libraries
Quickstart
import numpy as np
from numba import cuda
import math
@cuda.jit
def matmul(A, B, C):
# Perform matrix multiplication of C = A * B
row, col = cuda.grid(2)
if row < C.shape[0] and col < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[row, k] * B[k, col]
C[row, col] = tmp
# Example usage
N = 256
A_host = np.random.rand(N, N).astype(np.float32)
B_host = np.random.rand(N, N).astype(np.float32)
C_host = np.zeros((N, N), dtype=np.float32)
# Allocate device memory
A_device = cuda.to_device(A_host)
B_device = cuda.to_device(B_host)
C_device = cuda.to_device(C_host)
# Configure the blocks and threads
threads_per_block = (16, 16)
blocks_per_grid_x = int(math.ceil(A_host.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(B_host.shape[1] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
# Launch the kernel
matmul[blocks_per_grid, threads_per_block](A_device, B_device, C_device)
# Copy the result back to the host
C_result = C_device.copy_to_host()
print('Matrix multiplication completed on GPU (via Numba wrapping CUBLAS).')
# For verification (optional, requires higher-level libraries to implicitly use CUBLAS or direct numpy CPU op)
# C_numpy = np.dot(A_host, B_host)
# print(f"Max absolute difference: {np.max(np.abs(C_result - C_numpy))}") # Should be very small