NVIDIA cuDNN Frontend
The `nvidia-cudnn-frontend` is a Python library that provides a high-level, user-friendly API to interact with the cuDNN deep learning library backend. It facilitates the creation and execution of optimized tensor operations, including various fusions and custom kernels, specifically designed for NVIDIA GPUs. It is currently at version 1.22.1 and maintains an active release cadence, often aligning with new cuDNN backend releases.
Warnings
- breaking The `nvidia-cudnn-frontend` Python package requires a compatible installation of the NVIDIA CUDA Toolkit and the cuDNN native C++ library on the system. Installing the Python package alone is insufficient for functionality.
- gotcha Specific `nvidia-cudnn-frontend` versions are often recommended for particular `cuDNN` backend and `CUDA Toolkit` versions. Mismatched versions can lead to runtime errors, performance issues, or inability to leverage new features.
- breaking In version 1.18.0, the library internally transitioned away from using the older `v0.x API` and now directly calls the cuDNN backend API. This change might break compatibility for users who were relying on or interacting with internal `v0.x` API constructs.
- gotcha Version 1.19.1 was released to address issues with `pybind11` versions and restore `cuda-12` toolkit support accidentally dropped in 1.19.0. Older or incompatible `pybind11` installations can cause installation or runtime failures.
Install
-
pip install nvidia-cudnn-frontend
Imports
- cudnn_frontend
import cudnn_frontend
Quickstart
import cudnn_frontend
import torch
# Ensure CUDA device is available
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available. This library requires a CUDA-enabled GPU.")
# Example: Create and execute a simple convolution graph
# Define input and weight tensors on CUDA
x = torch.randn(1, 1, 28, 28, device="cuda", dtype=torch.float32)
w = torch.randn(16, 1, 3, 3, device="cuda", dtype=torch.float32)
# Create a cuDNN frontend graph
graph = cudnn_frontend.create_graph({"fp8_mode": False}) # fp8_mode can be set to True for FP8 operations
# Make input tensors for the graph from PyTorch tensors
X = graph.make_input_tensor(
"X", cudnn_frontend.DataType.FLOAT, x.shape, x.stride()
)
W = graph.make_input_tensor(
"W", cudnn_frontend.DataType.FLOAT, w.shape, w.stride()
)
# Define a convolution operation
Y = graph.make_convolution(
X, W, padding=[1, 1], stride=[1, 1], dilation=[1, 1]
)
# Mark the output tensor
Y.set_output()
# Build the operation graph, create execution plans, and check support
graph.build_operation_graph()
graph.create_execution_plans(
[cudnn_frontend.heur_mode.A, cudnn_frontend.heur_mode.FALLBACK]
)
graph.check_support()
graph.build_plans()
# Allocate the output tensor on CUDA
y_out = torch.empty(Y.get_output_tensors()[0].get_dim(), device="cuda", dtype=torch.float32)
# Execute the graph
graph.execute([x, w], [y_out])
print("Graph execution successful!")
print(f"Output tensor shape: {y_out.shape}")