NVIDIA CUTLASS Python DSL Base Libraries

4.4.2 · active · verified Sat Apr 11

NVIDIA CUTLASS Python DSL (`nvidia-cutlass-dsl-libs-base`) provides a Pythonic interface for writing high-performance CUDA kernels using CUTLASS's CuTe library and tensor abstractions. It enables kernel development with automatic compilation to optimized PTX/SASS, offering performance comparable to hand-written CUDA C++ while enhancing developer productivity. Currently at version 4.4.2, the library is actively developed with frequent releases, often tied to new CUDA Toolkit versions and NVIDIA GPU architectures.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to define a simple element-wise addition CUDA kernel using the CuTe DSL. It shows the use of the `@cute.kernel` decorator, `cute.Tensor` for arguments, accessing thread indices with `cute.arch.thread_idx()`, converting PyTorch tensors using `from_dlpack`, compiling the kernel with `cute.compile`, and launching it on the GPU.

import cutlass.cute as cute
import torch
from cutlass.cute.runtime import from_dlpack

@cute.kernel
def elementwise_add_kernel(
    gA: cute.Tensor, 
    gB: cute.Tensor, 
    gC: cute.Tensor,
    shape: cute.Shape
):
    # Get thread index within the block
    tidx, _, _ = cute.arch.thread_idx()

    # Map thread index to global memory coordinate
    # Example: Simple 1D mapping for illustration
    # In real kernels, you'd use more sophisticated layouts and transforms
    val_layout = cute.make_layout(shape)
    coords = val_layout(tidx)

    # Perform element-wise addition
    if tidx < shape[0] * shape[1]: # Basic bounds check
        gC[coords] = gA[coords] + gB[coords]

M, N = 1024, 512
A_torch = torch.randn(M, N, dtype=torch.float32, device='cuda')
B_torch = torch.randn(M, N, dtype=torch.float32, device='cuda')
C_torch = torch.zeros(M, N, dtype=torch.float32, device='cuda')

# Convert torch tensors to CuTe Tensors
mA = from_dlpack(A_torch).mark_layout_dynamic()
mB = from_dlpack(B_torch).mark_layout_dynamic()
mC = from_dlpack(C_torch).mark_layout_dynamic()

# Compile the kernel
compiled_kernel = cute.compile(elementwise_add_kernel, mA, mB, mC, (M, N))

# Launch the kernel
# A simple block/grid configuration. More complex kernels would use CuTe's layout algebra.
block_size = 256 # Example thread block size
grid_size = (M * N + block_size - 1) // block_size # Ensure enough blocks

compiled_kernel.launch(grid=[grid_size, 1, 1], block=[block_size, 1, 1])

# Verify (optional, requires torch.testing)
try:
    torch.testing.assert_close(C_torch, A_torch + B_torch)
    print("Kernel executed successfully and results match!")
except AssertionError as e:
    print(f"Verification failed: {e}")

view raw JSON →