NVIDIA CUBLAS Runtime Libraries for Python (cu11)

11.11.3.6 · active · verified Fri Apr 10

The `nvidia-cublas-cu11` package provides the native CUBLAS runtime libraries for NVIDIA GPUs, specifically for CUDA 11 environments. CUBLAS is NVIDIA's highly optimized implementation of BLAS (Basic Linear Algebra Subprograms) which is critical for accelerating AI and HPC workloads. This package allows Python environments to access GPU computational resources for linear algebra operations, typically as a dependency for higher-level frameworks like PyTorch, TensorFlow, or through wrappers like Numba. The current version is 11.11.3.6, with releases generally aligned with CUDA Toolkit updates and subsequent patch releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to utilize GPU-accelerated linear algebra through Numba, which in turn leverages the underlying CUBLAS libraries provided by `nvidia-cublas-cu11`. It performs a basic matrix multiplication, highlighting the necessary steps for device memory allocation, kernel execution, and result retrieval. Ensure you have Numba installed (`pip install numba`) and a compatible NVIDIA GPU and CUDA Toolkit. This example uses a custom kernel but Numba can also directly call CUBLAS functions for some operations.

import numpy as np
from numba import cuda
import math

@cuda.jit
def matmul(A, B, C):
    # Perform matrix multiplication of C = A * B
    row, col = cuda.grid(2)
    if row < C.shape[0] and col < C.shape[1]:
        tmp = 0.
        for k in range(A.shape[1]):
            tmp += A[row, k] * B[k, col]
        C[row, col] = tmp

# Example usage
N = 256
A_host = np.random.rand(N, N).astype(np.float32)
B_host = np.random.rand(N, N).astype(np.float32)
C_host = np.zeros((N, N), dtype=np.float32)

# Allocate device memory
A_device = cuda.to_device(A_host)
B_device = cuda.to_device(B_host)
C_device = cuda.to_device(C_host)

# Configure the blocks and threads
threads_per_block = (16, 16)
blocks_per_grid_x = int(math.ceil(A_host.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(B_host.shape[1] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

# Launch the kernel
matmul[blocks_per_grid, threads_per_block](A_device, B_device, C_device)

# Copy the result back to the host
C_result = C_device.copy_to_host()

print('Matrix multiplication completed on GPU (via Numba wrapping CUBLAS).')
# For verification (optional, requires higher-level libraries to implicitly use CUBLAS or direct numpy CPU op)
# C_numpy = np.dot(A_host, B_host)
# print(f"Max absolute difference: {np.max(np.abs(C_result - C_numpy))}") # Should be very small

view raw JSON →