{"library":"nvidia-cutlass-dsl","title":"NVIDIA CUTLASS Python DSL","description":"NVIDIA CUTLASS Python DSL (version 4.4.2) is a Python-based domain-specific language (DSL) for writing high-performance CUDA kernels. It provides a Pythonic interface to CUTLASS's CuTe library, enabling kernel development with automatic JIT compilation to optimized PTX/SASS for NVIDIA GPUs (Ampere, Hopper, Blackwell architectures). It aims for zero-cost abstraction, performance comparable to C++ kernels, and seamless integration with deep learning frameworks like PyTorch and JAX. The library maintains an active development pace with frequent updates and minor version releases.","language":"python","status":"active","last_verified":"Thu May 14","install":{"commands":["pip install nvidia-cutlass-dsl","pip install nvidia-cutlass-dsl[cu13]"],"cli":{"name":"cutlass","version":"sh: 1: cutlass: not found"}},"imports":["import cutlass.cute as cute","from cutlass.cute import kernel","from cutlass.cute import jit","from cutlass.cute.runtime import from_dlpack"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import cutlass.cute as cute\nimport torch\n\n@cute.kernel\ndef elementwise_add_kernel(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor):\n    # Get thread index (tidx) and block index (bidx)\n    tidx, _, _ = cute.arch.thread_idx()\n    bidx, _, _ = cute.arch.block_idx()\n    \n    # Calculate global index (simple 1D mapping for demonstration)\n    # In a real kernel, this would involve more complex layout algebra\n    global_idx = bidx * cute.block_dim_x() + tidx\n    \n    # Perform element-wise addition\n    if global_idx < gC.size():\n        gC[global_idx] = gA[global_idx] + gB[global_idx]\n\n@cute.jit\ndef launch_add_kernel(A, B, C):\n    # Launch the kernel\n    num_elements = A.size()\n    threads_per_block = 256 # Example thread block size\n    blocks_per_grid = (num_elements + threads_per_block - 1) // threads_per_block\n\n    elementwise_add_kernel(\n        cute.runtime.from_dlpack(A),\n        cute.runtime.from_dlpack(B),\n        cute.runtime.from_dlpack(C)\n    ).launch(\n        grid=[blocks_per_grid, 1, 1],\n        block=[threads_per_block, 1, 1]\n    )\n\nif __name__ == '__main__':\n    # Create example PyTorch tensors on GPU\n    size = 1024 * 1024 # 1 million elements\n    A_torch = torch.randn(size, dtype=torch.float32, device='cuda')\n    B_torch = torch.randn(size, dtype=torch.float32, device='cuda')\n    C_torch = torch.empty_like(A_torch, device='cuda')\n\n    # Launch the CuTe DSL kernel\n    launch_add_kernel(A_torch, B_torch, C_torch)\n\n    # Verify results (optional, using torch for comparison)\n    C_expected = A_torch + B_torch\n    assert torch.allclose(C_torch, C_expected, atol=1e-5), \"Results do not match!\"\n    print(\"Kernel executed successfully and results verified.\")\n","lang":"python","description":"This quickstart demonstrates a simple element-wise addition kernel written using CuTe DSL. It defines a GPU kernel with `@cute.kernel` and a host-side launch function with `@cute.jit`. It also shows how to interoperate with PyTorch tensors using `cute.runtime.from_dlpack` to pass data to the JIT-compiled kernel. The example performs vector addition on CUDA, launches the kernel, and verifies the output against PyTorch's native operation.","tag":null,"tag_description":null,"last_tested":"2026-04-25","results":[{"runtime":"python:3.10-alpine","exit_code":1},{"runtime":"python:3.10-slim","exit_code":1},{"runtime":"python:3.11-alpine","exit_code":1},{"runtime":"python:3.11-slim","exit_code":1},{"runtime":"python:3.12-alpine","exit_code":1},{"runtime":"python:3.12-slim","exit_code":1},{"runtime":"python:3.13-alpine","exit_code":1},{"runtime":"python:3.13-slim","exit_code":1},{"runtime":"python:3.9-alpine","exit_code":1},{"runtime":"python:3.9-slim","exit_code":1}]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-14","installed_version":"4.5.0","pypi_latest":"4.5.0","is_stale":false,"summary":{"python_range":"3.10–3.9","success_rate":40,"avg_install_s":7,"avg_import_s":1.82,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":6.3,"import_time_s":0.97,"mem_mb":31.4,"disk_size":"291M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.87,"mem_mb":30.4,"disk_size":"290M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":8.2,"import_time_s":0.84,"mem_mb":32.4,"disk_size":"302M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.81,"mem_mb":31.3,"disk_size":"300M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":5.9,"import_time_s":2.16,"mem_mb":35.8,"disk_size":"301M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":2.15,"mem_mb":34.7,"disk_size":"299M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":8.4,"import_time_s":2.64,"mem_mb":37.1,"disk_size":"312M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":2.51,"mem_mb":35.8,"disk_size":"309M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":5.7,"import_time_s":2.11,"mem_mb":36.4,"disk_size":"288M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":2.54,"mem_mb":35.2,"disk_size":"287M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":7.9,"import_time_s":2.19,"mem_mb":37.7,"disk_size":"300M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":2.21,"mem_mb":36.4,"disk_size":"297M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":5.7,"import_time_s":1.64,"mem_mb":34.2,"disk_size":"288M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":1.89,"mem_mb":33.1,"disk_size":"287M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":8.1,"import_time_s":1.78,"mem_mb":35.6,"disk_size":"299M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":1.87,"mem_mb":34.3,"disk_size":"297M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":2,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":1.9,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null}]}}