NVIDIA CUTLASS Python DSL Base Libraries
NVIDIA CUTLASS Python DSL (`nvidia-cutlass-dsl-libs-base`) provides a Pythonic interface for writing high-performance CUDA kernels using CUTLASS's CuTe library and tensor abstractions. It enables kernel development with automatic compilation to optimized PTX/SASS, offering performance comparable to hand-written CUDA C++ while enhancing developer productivity. Currently at version 4.4.2, the library is actively developed with frequent releases, often tied to new CUDA Toolkit versions and NVIDIA GPU architectures.
Warnings
- breaking The legacy Python API package, previously named `cutlass` (e.g., `import cutlass`), was renamed to `cutlass_cppgen` in CUTLASS 4.2.0 (around September 2025). Direct imports of `cutlass` for the high-level C++ wrappers will fail.
- gotcha CUTLASS Python DSL (including `nvidia-cutlass-dsl-libs-base`) has strict compatibility requirements with specific CUDA Toolkit and NVIDIA driver versions. Mismatches can lead to runtime errors or compilation failures.
- gotcha Unexpected CPU overhead was introduced in version 4.3.4 of the CuTe DSL.
- gotcha Initial releases of CUTLASS DSL 4.0 had limited Python version support (e.g., Python 3.12 only). While newer versions expand this, ensure your Python version is explicitly supported.
- gotcha Version 4.4.1 fixed a segfault issue when using `tvm-ffi` on aarch64 systems.
Install
-
pip install nvidia-cutlass-dsl -
pip install nvidia-cutlass-dsl-libs-base -
pip install nvidia-cutlass-dsl[cu13]
Imports
- cute
import cutlass.cute as cute
- from_dlpack
from cutlass.cute.runtime import from_dlpack
- cutlass
import cutlass_cppgen as cutlass
Quickstart
import cutlass.cute as cute
import torch
from cutlass.cute.runtime import from_dlpack
@cute.kernel
def elementwise_add_kernel(
gA: cute.Tensor,
gB: cute.Tensor,
gC: cute.Tensor,
shape: cute.Shape
):
# Get thread index within the block
tidx, _, _ = cute.arch.thread_idx()
# Map thread index to global memory coordinate
# Example: Simple 1D mapping for illustration
# In real kernels, you'd use more sophisticated layouts and transforms
val_layout = cute.make_layout(shape)
coords = val_layout(tidx)
# Perform element-wise addition
if tidx < shape[0] * shape[1]: # Basic bounds check
gC[coords] = gA[coords] + gB[coords]
M, N = 1024, 512
A_torch = torch.randn(M, N, dtype=torch.float32, device='cuda')
B_torch = torch.randn(M, N, dtype=torch.float32, device='cuda')
C_torch = torch.zeros(M, N, dtype=torch.float32, device='cuda')
# Convert torch tensors to CuTe Tensors
mA = from_dlpack(A_torch).mark_layout_dynamic()
mB = from_dlpack(B_torch).mark_layout_dynamic()
mC = from_dlpack(C_torch).mark_layout_dynamic()
# Compile the kernel
compiled_kernel = cute.compile(elementwise_add_kernel, mA, mB, mC, (M, N))
# Launch the kernel
# A simple block/grid configuration. More complex kernels would use CuTe's layout algebra.
block_size = 256 # Example thread block size
grid_size = (M * N + block_size - 1) // block_size # Ensure enough blocks
compiled_kernel.launch(grid=[grid_size, 1, 1], block=[block_size, 1, 1])
# Verify (optional, requires torch.testing)
try:
torch.testing.assert_close(C_torch, A_torch + B_torch)
print("Kernel executed successfully and results match!")
except AssertionError as e:
print(f"Verification failed: {e}")