{"id":2620,"library":"nvidia-cutlass-dsl","title":"NVIDIA CUTLASS Python DSL","description":"NVIDIA CUTLASS Python DSL (version 4.4.2) is a Python-based domain-specific language (DSL) for writing high-performance CUDA kernels. It provides a Pythonic interface to CUTLASS's CuTe library, enabling kernel development with automatic JIT compilation to optimized PTX/SASS for NVIDIA GPUs (Ampere, Hopper, Blackwell architectures). It aims for zero-cost abstraction, performance comparable to C++ kernels, and seamless integration with deep learning frameworks like PyTorch and JAX. The library maintains an active development pace with frequent updates and minor version releases.","status":"active","version":"4.4.2","language":"en","source_language":"en","source_url":"https://github.com/NVIDIA/cutlass","tags":["cuda","gpu","deep learning","high performance","compiler","dsl","nvidia","kernel development"],"install":[{"cmd":"pip install nvidia-cutlass-dsl","lang":"bash","label":"Base Installation"},{"cmd":"pip install nvidia-cutlass-dsl[cu13]","lang":"bash","label":"For CUDA Toolkit 13.1+"}],"dependencies":[{"reason":"Recommended for running examples and PyTorch interoperability.","package":"torch","optional":true},{"reason":"Recommended for educational notebooks and development.","package":"jupyter","optional":true},{"reason":"Recommended for JAX integration and examples (specific versions recommended).","package":"jax[cuda]","optional":true},{"reason":"Optional for improved PyTorch interop and faster JIT function invocation.","package":"apache-tvm-ffi","optional":true},{"reason":"Optional, often installed alongside tvm-ffi for DLPack protocol integration.","package":"torch-c-dlpack-ext","optional":true}],"imports":[{"symbol":"cute","correct":"import cutlass.cute as cute"},{"note":"Decorator for defining GPU kernel functions.","symbol":"kernel","correct":"from cutlass.cute import kernel"},{"note":"Decorator for defining host-side JIT-compiled functions.","symbol":"jit","correct":"from cutlass.cute import jit"},{"note":"For converting framework tensors (e.g., PyTorch) to CuTe tensors.","symbol":"from_dlpack","correct":"from cutlass.cute.runtime import from_dlpack"}],"quickstart":{"code":"import cutlass.cute as cute\nimport torch\n\n@cute.kernel\ndef elementwise_add_kernel(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor):\n    # Get thread index (tidx) and block index (bidx)\n    tidx, _, _ = cute.arch.thread_idx()\n    bidx, _, _ = cute.arch.block_idx()\n    \n    # Calculate global index (simple 1D mapping for demonstration)\n    # In a real kernel, this would involve more complex layout algebra\n    global_idx = bidx * cute.block_dim_x() + tidx\n    \n    # Perform element-wise addition\n    if global_idx < gC.size():\n        gC[global_idx] = gA[global_idx] + gB[global_idx]\n\n@cute.jit\ndef launch_add_kernel(A, B, C):\n    # Launch the kernel\n    num_elements = A.size()\n    threads_per_block = 256 # Example thread block size\n    blocks_per_grid = (num_elements + threads_per_block - 1) // threads_per_block\n\n    elementwise_add_kernel(\n        cute.runtime.from_dlpack(A),\n        cute.runtime.from_dlpack(B),\n        cute.runtime.from_dlpack(C)\n    ).launch(\n        grid=[blocks_per_grid, 1, 1],\n        block=[threads_per_block, 1, 1]\n    )\n\nif __name__ == '__main__':\n    # Create example PyTorch tensors on GPU\n    size = 1024 * 1024 # 1 million elements\n    A_torch = torch.randn(size, dtype=torch.float32, device='cuda')\n    B_torch = torch.randn(size, dtype=torch.float32, device='cuda')\n    C_torch = torch.empty_like(A_torch, device='cuda')\n\n    # Launch the CuTe DSL kernel\n    launch_add_kernel(A_torch, B_torch, C_torch)\n\n    # Verify results (optional, using torch for comparison)\n    C_expected = A_torch + B_torch\n    assert torch.allclose(C_torch, C_expected, atol=1e-5), \"Results do not match!\"\n    print(\"Kernel executed successfully and results verified.\")\n","lang":"python","description":"This quickstart demonstrates a simple element-wise addition kernel written using CuTe DSL. It defines a GPU kernel with `@cute.kernel` and a host-side launch function with `@cute.jit`. It also shows how to interoperate with PyTorch tensors using `cute.runtime.from_dlpack` to pass data to the JIT-compiled kernel. The example performs vector addition on CUDA, launches the kernel, and verifies the output against PyTorch's native operation."},"warnings":[{"fix":"Rewrite kernels and host interaction using the CuTe DSL decorators (`@cute.kernel`, `@cute.jit`) and CuTe tensor abstractions. Refer to the 'Limitations' and 'FAQs' sections in the official documentation.","message":"NVIDIA CUTLASS Python DSL (CuTe DSL) is a distinct project from the older 'CUTLASS Python' (which was a Python interface for C++ kernels). Existing code relying on the older interface will not be compatible.","severity":"breaking","affected_versions":"4.0.0 and later"},{"fix":"Ensure your installed CUDA Toolkit version is compatible with the `nvidia-cutlass-dsl` version. For CUDA Toolkit 13.1+, use `pip install nvidia-cutlass-dsl[cu13]`. Always check release notes for specific version requirements.","message":"The DSL requires a specific NVIDIA CUDA Toolkit version. For example, version 4.4.2 supports Python 3.10-3.14 and requires CUDA Toolkit 12.0+ (with 13.1 recommended for latest features like GB300 and Hopper FMHA fixes). Incompatible toolkit versions can lead to performance regressions, compilation errors, or runtime issues.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Understand and adhere to the DSL's programming model. Use primitive types (int, bool, float) as dynamic values. For complex data, use them for 'meta-programming' or configuration during compilation, not as modifiable runtime data within the kernel. Refer to the 'Limitations' documentation.","message":"CuTe DSL has design limitations regarding Python language semantics within JIT-compiled functions. Complex data structures like lists, tuples, or dictionaries passed as dynamic values are treated as static containers and cannot be modified at runtime inside kernels. Returning dynamic values from kernels is also currently limited.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Install the required `tvm-ffi` packages and ensure TVM FFI is correctly enabled in your code or environment if you intend to use it.","message":"Optional features like Apache TVM FFI, which improves PyTorch interoperability and reduces host overhead, require separate installation (`pip install apache-tvm-ffi torch-c-dlpack-ext`) and explicit enabling (e.g., via `enable_tvm_ffi=True` in `cute.runtime.from_dlpack` or by setting `CUTE_DSL_ENABLE_TVM_FFI=1` environment variable).","severity":"gotcha","affected_versions":"4.3.0 and later"},{"fix":"Update calls to affected `cute.arch` functions to pass string literals (e.g., `'Release'`, `'Acquire'`) instead of previous enum-like objects. Consult the `changelog` and documentation for specific function signatures.","message":"API changes in `cutlass.cute.arch` functions (e.g., `fence_proxy`, `warp_redux_sync`, `atomic_add`, `load`, `store`) in CUDA Toolkit 13.1+ environments now require string literals instead of enum arguments.","severity":"breaking","affected_versions":"4.4.0 and later (when used with CTK 13.1+)"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}