Numba CUDA Target
Numba-cuda provides a CUDA target for the Numba Python JIT compiler, enabling Python functions to be compiled and executed on NVIDIA GPUs. It allows users to write custom GPU kernels and device functions directly in a subset of Python. The library, currently at version 0.30.0, is actively developed by NVIDIA, with its release cycle now decoupled from the main Numba project to facilitate more frequent updates and new feature development.
Warnings
- deprecated The built-in CUDA target in the main `numba` package is deprecated. New features and most bug fixes are now exclusively implemented in `numba-cuda`. While the old target remains for compatibility, it's strongly recommended to install `numba-cuda` for active development and to ensure access to the latest capabilities.
- breaking In `numba-cuda` v0.28.0, there was an attempt to shift error classes from `numba.core.errors.TypingError` to `numba.cuda.errors` namespaces. This caused compatibility issues with existing code that relied on catching the old error types and was subsequently reverted. Users should be aware that such internal error type changes can be breaking.
- breaking The internal `DeviceArray` implementation underwent refactoring, and certain internal `enums` and `ctypes` code were removed in `numba-cuda` v0.23.0 and v0.28.0 respectively. Code that directly interacted with these internal components or undocumented APIs may break.
- gotcha Numba CUDA kernel functions cannot return values. Any results computed within a kernel must be written to arrays passed as arguments to the kernel. This is a common pattern in CUDA C/C++ and applies to Numba CUDA kernels as well.
- gotcha The first call to a Numba CUDA kernel includes the Just-In-Time (JIT) compilation overhead, which can be significant. For accurate performance benchmarking, always time subsequent calls to the kernel after the initial compilation has completed (e.g., by performing a 'warm-up' run).
- breaking Support for NVIDIA GPUs with compute capability less than 5.0 is deprecated and will be removed in future releases. Additionally, Numba-CUDA requires a minimum CUDA Toolkit version of 11.2.
Install
-
pip install numba-cuda -
conda install -c conda-forge numba-cuda
Imports
- cuda
from numba import cuda
Quickstart
import numpy as np
from numba import cuda
import os
# Check for CUDA availability (runtime dependency)
if not cuda.is_available():
print("CUDA is not available. Please ensure you have an NVIDIA GPU and CUDA drivers installed.")
exit()
# Define a CUDA kernel
@cuda.jit
def add_vectors(x, y, out):
idx = cuda.grid(1)
if idx < len(out):
out[idx] = x[idx] + y[idx]
# Host-side code
N = 1000000
x_host = np.arange(N, dtype=np.float32)
y_host = np.arange(N, dtype=np.float32)
out_host = np.empty_like(x_host)
# Allocate memory on the device and copy data
x_device = cuda.to_device(x_host)
y_device = cuda.to_device(y_host)
out_device = cuda.device_array_like(out_host)
# Configure the kernel launch
threadsperblock = 256
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock
# Launch the kernel
add_vectors[blockspergrid, threadsperblock](x_device, y_device, out_device)
# Copy the result back to the host
out_device.copy_to_host(out_host)
# Verify the result
expected_out = x_host + y_host
assert np.allclose(out_host, expected_out)
print("Vector addition on GPU successful!")