CuPy (CUDA 12.x)
CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python, utilizing NVIDIA CUDA or AMD ROCm platforms. It provides an `ndarray` and a rich set of routines with an API designed to be a drop-in replacement for NumPy and SciPy. The current version, 14.0.1, is a stable release, with the project generally following a bi-monthly release cadence for stable versions and major updates occurring less frequently.
Warnings
- breaking CuPy v14 updates its type promotion rules and casting behavior to align with NumPy v2 semantics. Code relying on NumPy v1 specific behaviors in earlier CuPy versions (v13 and prior) may behave differently.
- breaking CuPy v14 drops support for CUDA 11 and Python 3.9. Additionally, all cuDNN-related functionality has been completely removed from CuPy. Users requiring cuDNN should consider external libraries like cuDNN Frontend.
- gotcha CuPy uses a memory pool for GPU allocations. This means GPU memory might not be immediately released back to the system even after arrays go out of scope, which can cause utilities like `nvidia-smi` to report higher memory usage than expected.
- gotcha Frequent data transfers between CPU (host) and GPU (device) are a major performance bottleneck due to PCIe bandwidth limitations. Avoid 'round-tripping' arrays in hot loops.
- gotcha The first time a CuPy kernel is called for specific shapes and data types, it may experience a brief pause for JIT compilation. Subsequent calls with the same parameters will use the cached, pre-compiled kernel.
- gotcha Installing a `cupy-cudaXX` package that does not match your system's CUDA Toolkit version or compatible driver can lead to import failures or runtime errors.
- breaking In CuPy v13+, the default behavior for transferring NumPy arrays backed by pinned memory from CPU to GPU (`cupy.array()`, `cupy.asarray()`) changed from blocking to asynchronous. This can improve performance but may introduce data races if the source array is modified on the CPU before the asynchronous transfer completes.
Install
-
pip install cupy-cuda12x -
pip install "cupy-cuda12x[ctk]"
Imports
- cupy
import cupy as cp
- cupyx.scipy
import cupyx.scipy as sp
Quickstart
import cupy as cp
import numpy as np
# Check if a GPU is available
if cp.cuda.is_available():
print("GPU is available. Current device:", cp.cuda.Device().id)
# Create a CuPy array on the GPU
x_gpu = cp.arange(10, dtype=cp.float32).reshape(2, 5)
print("CuPy array on GPU:\n", x_gpu)
# Perform an operation on the GPU
y_gpu = x_gpu * 2 + 1
print("Result of GPU operation:\n", y_gpu)
# Transfer the result back to CPU (NumPy array)
y_cpu = cp.asnumpy(y_gpu)
print("Result on CPU (NumPy array):\n", y_cpu)
# Example: Dot product
a_gpu = cp.array([[1, 2], [3, 4]], dtype=cp.float32)
b_gpu = cp.array([[5, 6], [7, 8]], dtype=cp.float32)
c_gpu = a_gpu @ b_gpu
print("\nMatrix multiplication on GPU:\n", c_gpu)
# For accurate performance timings, ensure GPU operations complete
cp.cuda.Stream.null.synchronize()
else:
print("No GPU available. CuPy will not be functional.")