{"library":"nvidia-cutlass-dsl-libs-base","title":"NVIDIA CUTLASS Python DSL Base Libraries","description":"NVIDIA CUTLASS Python DSL (`nvidia-cutlass-dsl-libs-base`) provides a Pythonic interface for writing high-performance CUDA kernels using CUTLASS's CuTe library and tensor abstractions. It enables kernel development with automatic compilation to optimized PTX/SASS, offering performance comparable to hand-written CUDA C++ while enhancing developer productivity. Currently at version 4.4.2, the library is actively developed with frequent releases, often tied to new CUDA Toolkit versions and NVIDIA GPU architectures.","language":"python","status":"active","last_verified":"Thu May 14","install":{"commands":["pip install nvidia-cutlass-dsl","pip install nvidia-cutlass-dsl-libs-base","pip install nvidia-cutlass-dsl[cu13]"],"cli":null},"imports":["import cutlass.cute as cute","from cutlass.cute.runtime import from_dlpack","import cutlass_cppgen as cutlass"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import cutlass.cute as cute\nimport torch\nfrom cutlass.cute.runtime import from_dlpack\n\n@cute.kernel\ndef elementwise_add_kernel(\n    gA: cute.Tensor, \n    gB: cute.Tensor, \n    gC: cute.Tensor,\n    shape: cute.Shape\n):\n    # Get thread index within the block\n    tidx, _, _ = cute.arch.thread_idx()\n\n    # Map thread index to global memory coordinate\n    # Example: Simple 1D mapping for illustration\n    # In real kernels, you'd use more sophisticated layouts and transforms\n    val_layout = cute.make_layout(shape)\n    coords = val_layout(tidx)\n\n    # Perform element-wise addition\n    if tidx < shape[0] * shape[1]: # Basic bounds check\n        gC[coords] = gA[coords] + gB[coords]\n\nM, N = 1024, 512\nA_torch = torch.randn(M, N, dtype=torch.float32, device='cuda')\nB_torch = torch.randn(M, N, dtype=torch.float32, device='cuda')\nC_torch = torch.zeros(M, N, dtype=torch.float32, device='cuda')\n\n# Convert torch tensors to CuTe Tensors\nmA = from_dlpack(A_torch).mark_layout_dynamic()\nmB = from_dlpack(B_torch).mark_layout_dynamic()\nmC = from_dlpack(C_torch).mark_layout_dynamic()\n\n# Compile the kernel\ncompiled_kernel = cute.compile(elementwise_add_kernel, mA, mB, mC, (M, N))\n\n# Launch the kernel\n# A simple block/grid configuration. More complex kernels would use CuTe's layout algebra.\nblock_size = 256 # Example thread block size\ngrid_size = (M * N + block_size - 1) // block_size # Ensure enough blocks\n\ncompiled_kernel.launch(grid=[grid_size, 1, 1], block=[block_size, 1, 1])\n\n# Verify (optional, requires torch.testing)\ntry:\n    torch.testing.assert_close(C_torch, A_torch + B_torch)\n    print(\"Kernel executed successfully and results match!\")\nexcept AssertionError as e:\n    print(f\"Verification failed: {e}\")\n","lang":"python","description":"This quickstart demonstrates how to define a simple element-wise addition CUDA kernel using the CuTe DSL. It shows the use of the `@cute.kernel` decorator, `cute.Tensor` for arguments, accessing thread indices with `cute.arch.thread_idx()`, converting PyTorch tensors using `from_dlpack`, compiling the kernel with `cute.compile`, and launching it on the GPU.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-14","installed_version":"4.5.0","pypi_latest":"4.5.0","is_stale":false,"summary":{"python_range":"3.10–3.9","success_rate":40,"avg_install_s":6.9,"avg_import_s":1.81,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":6.4,"import_time_s":0.89,"mem_mb":31.4,"disk_size":"291M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.87,"mem_mb":30.4,"disk_size":"290M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":6.2,"import_time_s":0.86,"mem_mb":31.4,"disk_size":"291M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.81,"mem_mb":30.4,"disk_size":"290M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":9.1,"import_time_s":0.85,"mem_mb":32.5,"disk_size":"302M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.87,"mem_mb":31.3,"disk_size":"300M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":6.4,"import_time_s":2.36,"mem_mb":35.8,"disk_size":"301M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":2.18,"mem_mb":34.7,"disk_size":"299M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":6,"import_time_s":2.4,"mem_mb":35.8,"disk_size":"301M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":2.06,"mem_mb":34.7,"disk_size":"299M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":8.5,"import_time_s":2.39,"mem_mb":37.1,"disk_size":"312M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":2.24,"mem_mb":35.8,"disk_size":"309M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":5.8,"import_time_s":2.35,"mem_mb":36.4,"disk_size":"288M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":2.48,"mem_mb":35.2,"disk_size":"287M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":5.8,"import_time_s":2.08,"mem_mb":36.4,"disk_size":"288M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":2.25,"mem_mb":35.2,"disk_size":"287M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":8.2,"import_time_s":2.28,"mem_mb":37.7,"disk_size":"300M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":2.69,"mem_mb":36.4,"disk_size":"297M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":6.3,"import_time_s":1.6,"mem_mb":34.2,"disk_size":"288M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":1.8,"mem_mb":33.1,"disk_size":"287M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":5.9,"import_time_s":1.64,"mem_mb":34.2,"disk_size":"288M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":1.81,"mem_mb":33.1,"disk_size":"286M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":8.2,"import_time_s":1.74,"mem_mb":35.6,"disk_size":"299M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"cu13","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":1.9,"mem_mb":34.3,"disk_size":"297M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":1.9,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"nvidia-cutlass-dsl-libs-base","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"cu13","exit_code":1,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null}]}}