NVIDIA CUTLASS Python DSL

4.4.2 · active · verified Fri Apr 10

NVIDIA CUTLASS Python DSL (version 4.4.2) is a Python-based domain-specific language (DSL) for writing high-performance CUDA kernels. It provides a Pythonic interface to CUTLASS's CuTe library, enabling kernel development with automatic JIT compilation to optimized PTX/SASS for NVIDIA GPUs (Ampere, Hopper, Blackwell architectures). It aims for zero-cost abstraction, performance comparable to C++ kernels, and seamless integration with deep learning frameworks like PyTorch and JAX. The library maintains an active development pace with frequent updates and minor version releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates a simple element-wise addition kernel written using CuTe DSL. It defines a GPU kernel with `@cute.kernel` and a host-side launch function with `@cute.jit`. It also shows how to interoperate with PyTorch tensors using `cute.runtime.from_dlpack` to pass data to the JIT-compiled kernel. The example performs vector addition on CUDA, launches the kernel, and verifies the output against PyTorch's native operation.

import cutlass.cute as cute
import torch

@cute.kernel
def elementwise_add_kernel(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor):
    # Get thread index (tidx) and block index (bidx)
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()
    
    # Calculate global index (simple 1D mapping for demonstration)
    # In a real kernel, this would involve more complex layout algebra
    global_idx = bidx * cute.block_dim_x() + tidx
    
    # Perform element-wise addition
    if global_idx < gC.size():
        gC[global_idx] = gA[global_idx] + gB[global_idx]

@cute.jit
def launch_add_kernel(A, B, C):
    # Launch the kernel
    num_elements = A.size()
    threads_per_block = 256 # Example thread block size
    blocks_per_grid = (num_elements + threads_per_block - 1) // threads_per_block

    elementwise_add_kernel(
        cute.runtime.from_dlpack(A),
        cute.runtime.from_dlpack(B),
        cute.runtime.from_dlpack(C)
    ).launch(
        grid=[blocks_per_grid, 1, 1],
        block=[threads_per_block, 1, 1]
    )

if __name__ == '__main__':
    # Create example PyTorch tensors on GPU
    size = 1024 * 1024 # 1 million elements
    A_torch = torch.randn(size, dtype=torch.float32, device='cuda')
    B_torch = torch.randn(size, dtype=torch.float32, device='cuda')
    C_torch = torch.empty_like(A_torch, device='cuda')

    # Launch the CuTe DSL kernel
    launch_add_kernel(A_torch, B_torch, C_torch)

    # Verify results (optional, using torch for comparison)
    C_expected = A_torch + B_torch
    assert torch.allclose(C_torch, C_expected, atol=1e-5), "Results do not match!"
    print("Kernel executed successfully and results verified.")

view raw JSON →