NVIDIA CUTLASS Python DSL
NVIDIA CUTLASS Python DSL (version 4.4.2) is a Python-based domain-specific language (DSL) for writing high-performance CUDA kernels. It provides a Pythonic interface to CUTLASS's CuTe library, enabling kernel development with automatic JIT compilation to optimized PTX/SASS for NVIDIA GPUs (Ampere, Hopper, Blackwell architectures). It aims for zero-cost abstraction, performance comparable to C++ kernels, and seamless integration with deep learning frameworks like PyTorch and JAX. The library maintains an active development pace with frequent updates and minor version releases.
Warnings
- breaking NVIDIA CUTLASS Python DSL (CuTe DSL) is a distinct project from the older 'CUTLASS Python' (which was a Python interface for C++ kernels). Existing code relying on the older interface will not be compatible.
- gotcha The DSL requires a specific NVIDIA CUDA Toolkit version. For example, version 4.4.2 supports Python 3.10-3.14 and requires CUDA Toolkit 12.0+ (with 13.1 recommended for latest features like GB300 and Hopper FMHA fixes). Incompatible toolkit versions can lead to performance regressions, compilation errors, or runtime issues.
- gotcha CuTe DSL has design limitations regarding Python language semantics within JIT-compiled functions. Complex data structures like lists, tuples, or dictionaries passed as dynamic values are treated as static containers and cannot be modified at runtime inside kernels. Returning dynamic values from kernels is also currently limited.
- gotcha Optional features like Apache TVM FFI, which improves PyTorch interoperability and reduces host overhead, require separate installation (`pip install apache-tvm-ffi torch-c-dlpack-ext`) and explicit enabling (e.g., via `enable_tvm_ffi=True` in `cute.runtime.from_dlpack` or by setting `CUTE_DSL_ENABLE_TVM_FFI=1` environment variable).
- breaking API changes in `cutlass.cute.arch` functions (e.g., `fence_proxy`, `warp_redux_sync`, `atomic_add`, `load`, `store`) in CUDA Toolkit 13.1+ environments now require string literals instead of enum arguments.
Install
-
pip install nvidia-cutlass-dsl -
pip install nvidia-cutlass-dsl[cu13]
Imports
- cute
import cutlass.cute as cute
- kernel
from cutlass.cute import kernel
- jit
from cutlass.cute import jit
- from_dlpack
from cutlass.cute.runtime import from_dlpack
Quickstart
import cutlass.cute as cute
import torch
@cute.kernel
def elementwise_add_kernel(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor):
# Get thread index (tidx) and block index (bidx)
tidx, _, _ = cute.arch.thread_idx()
bidx, _, _ = cute.arch.block_idx()
# Calculate global index (simple 1D mapping for demonstration)
# In a real kernel, this would involve more complex layout algebra
global_idx = bidx * cute.block_dim_x() + tidx
# Perform element-wise addition
if global_idx < gC.size():
gC[global_idx] = gA[global_idx] + gB[global_idx]
@cute.jit
def launch_add_kernel(A, B, C):
# Launch the kernel
num_elements = A.size()
threads_per_block = 256 # Example thread block size
blocks_per_grid = (num_elements + threads_per_block - 1) // threads_per_block
elementwise_add_kernel(
cute.runtime.from_dlpack(A),
cute.runtime.from_dlpack(B),
cute.runtime.from_dlpack(C)
).launch(
grid=[blocks_per_grid, 1, 1],
block=[threads_per_block, 1, 1]
)
if __name__ == '__main__':
# Create example PyTorch tensors on GPU
size = 1024 * 1024 # 1 million elements
A_torch = torch.randn(size, dtype=torch.float32, device='cuda')
B_torch = torch.randn(size, dtype=torch.float32, device='cuda')
C_torch = torch.empty_like(A_torch, device='cuda')
# Launch the CuTe DSL kernel
launch_add_kernel(A_torch, B_torch, C_torch)
# Verify results (optional, using torch for comparison)
C_expected = A_torch + B_torch
assert torch.allclose(C_torch, C_expected, atol=1e-5), "Results do not match!"
print("Kernel executed successfully and results verified.")