NVIDIA cuFFT for CUDA 12
nvidia-cufft-cu12 provides the native runtime libraries for NVIDIA's CUDA Fast Fourier Transform (cuFFT) product, a GPU-accelerated library for performing FFT calculations. It is a fundamental component for various scientific and engineering applications, including deep learning, computer vision, and computational physics. The library is actively maintained by the Nvidia CUDA Installer Team and receives frequent updates; the current version is 11.4.1.4, released on June 5, 2025. It primarily serves as a low-level dependency for higher-level Python frameworks and libraries that leverage GPU-accelerated FFTs.
Warnings
- breaking Deprecated GPU architectures: From CUDA 12.0 onwards, GPU architectures SM35 and SM37 are no longer supported. The minimum required architecture is SM50. Older CUDA versions (e.g., 11.0) also deprecated earlier architectures like SM30.
- deprecated Legacy cuFFT callback functionality: Support for callback routines using separately compiled device code (legacy callbacks) has been deprecated since CUDA 11.4. CUDA Graphs capture for legacy callbacks that load data in out-of-place mode transforms is no longer supported from CUDA 11.8.
- gotcha Performance degradation with legacy callbacks: Users have reported significant performance decreases (up to 20% or more) when using legacy cuFFT callbacks in CUDA 11.8 and newer (e.g., 12.2, 12.4, 12.9+) compared to CUDA 11.7. This often manifests as increased time spent in `cuMemFree_v2` during `cufftExecC2R` or `R2C` operations.
- gotcha Memory leak with `nvc++ -cudalib=cufft`: A potential memory leak in cuFFT library version v10.9.0.58 (shipped with CUDA 11.8) when used with `nvc++` and the `-cudalib=cufft` flag. This was linked to cuFFT failing to deallocate internal structures if the active CUDA context at program finalization was not the same used for plan creation.
- gotcha Interference of `cudaDeviceReset()` with `cufftPlanMany`: Calling `cudaDeviceReset()` before `cufftPlanMany` can lead to `CUFFT_INTERNAL_ERROR`. While adding `cudaSetDevice(0)` after the reset might mitigate it, `cudaDeviceReset()` is generally not recommended for regular use.
- gotcha `CUFFT_INTERNAL_ERROR` in `cufftXtSetGPU` for multi-GPU FFTs: When performing large multi-GPU FFTs, `cufftXtSetGPU` can return an opaque 'internal error,' potentially indicating an out-of-memory condition or an unspecified library issue.
- gotcha Installation timeouts/failures with concurrent downloads from `pypi.nvidia.com`: Users attempting to install `nvidia-cufft-cu12` (and other NVIDIA PyPI packages) with tools that use concurrent downloads (e.g., `uv`) may experience failures due to timeout or network issues with `pypi.nvidia.com`.
Install
-
pip install nvidia-cufft-cu12 -
pip install nvmath-python[cu12]
Imports
- cuFFT functionality
This package is a low-level runtime library and is not directly imported in Python. High-level libraries like `nvmath-python`, `torch`, or `tensorflow` provide Python interfaces that utilize `nvidia-cufft-cu12` under the hood.
- nvmath.fft
from nvmath.fft import fft, ifft
Quickstart
import os
import nvmath.fft as nvfft
import cupy as cp
# Ensure CUDA is available and nvmath-python is correctly set up
# (e.g., pip install nvmath-python[cu12] and appropriate CUDA Toolkit installation)
# Example: Perform a 1D complex-to-complex FFT using nvmath-python
size = 1024
x = cp.arange(size, dtype=cp.complex64)
# Perform forward FFT
y = nvfft.fft(x)
# Perform inverse FFT
z = nvfft.ifft(y)
print(f"Original data (first 5 elements): {x[:5].tolist()}")
print(f"FFT result (first 5 elements): {y[:5].tolist()}")
print(f"Inverse FFT result (first 5 elements): {z[:5].tolist()}")
print(f"Difference from original (max abs error): {cp.max(cp.abs(x - z))}")