{"id":3185,"library":"nvidia-cutlass-dsl-libs-base","title":"NVIDIA CUTLASS Python DSL Base Libraries","description":"NVIDIA CUTLASS Python DSL (`nvidia-cutlass-dsl-libs-base`) provides a Pythonic interface for writing high-performance CUDA kernels using CUTLASS's CuTe library and tensor abstractions. It enables kernel development with automatic compilation to optimized PTX/SASS, offering performance comparable to hand-written CUDA C++ while enhancing developer productivity. Currently at version 4.4.2, the library is actively developed with frequent releases, often tied to new CUDA Toolkit versions and NVIDIA GPU architectures.","status":"active","version":"4.4.2","language":"en","source_language":"en","source_url":"https://github.com/NVIDIA/cutlass","tags":["cuda","gpu","high-performance","deep-learning","linear-algebra","nvidia","kernel","dsl","jit-compilation"],"install":[{"cmd":"pip install nvidia-cutlass-dsl","lang":"bash","label":"Recommended (installs `libs-base` as dependency)"},{"cmd":"pip install nvidia-cutlass-dsl-libs-base","lang":"bash","label":"Direct install (minimal)"},{"cmd":"pip install nvidia-cutlass-dsl[cu13]","lang":"bash","label":"For CUDA Toolkit 13.1 support"}],"dependencies":[{"reason":"Required for CUDA integration and kernel launch.","package":"cuda-python","optional":false},{"reason":"Recommended for integration with PyTorch frameworks and running examples.","package":"torch","optional":true},{"reason":"Recommended for integration with JAX frameworks and running examples.","package":"jax","optional":true},{"reason":"Common dependency for tensor operations, often used in examples.","package":"numpy","optional":false},{"reason":"Underlying dependency for some DSL functionalities.","package":"networkx","optional":false},{"reason":"Underlying dependency for some DSL functionalities.","package":"pydot","optional":false},{"reason":"Underlying dependency for some DSL functionalities.","package":"scipy","optional":false},{"reason":"Underlying dependency for some DSL functionalities.","package":"treelib","optional":false}],"imports":[{"note":"Primary import for the CuTe DSL functionalities.","symbol":"cute","correct":"import cutlass.cute as cute"},{"note":"Used for seamless integration with DLPack-compatible frameworks like PyTorch.","symbol":"from_dlpack","correct":"from cutlass.cute.runtime import from_dlpack"},{"note":"The legacy Python API package `cutlass` was renamed to `cutlass_cppgen` in CUTLASS 4.2.0 to disambiguate with the CuTe DSL.","wrong":"import cutlass","symbol":"cutlass","correct":"import cutlass_cppgen as cutlass"}],"quickstart":{"code":"import cutlass.cute as cute\nimport torch\nfrom cutlass.cute.runtime import from_dlpack\n\n@cute.kernel\ndef elementwise_add_kernel(\n    gA: cute.Tensor, \n    gB: cute.Tensor, \n    gC: cute.Tensor,\n    shape: cute.Shape\n):\n    # Get thread index within the block\n    tidx, _, _ = cute.arch.thread_idx()\n\n    # Map thread index to global memory coordinate\n    # Example: Simple 1D mapping for illustration\n    # In real kernels, you'd use more sophisticated layouts and transforms\n    val_layout = cute.make_layout(shape)\n    coords = val_layout(tidx)\n\n    # Perform element-wise addition\n    if tidx < shape[0] * shape[1]: # Basic bounds check\n        gC[coords] = gA[coords] + gB[coords]\n\nM, N = 1024, 512\nA_torch = torch.randn(M, N, dtype=torch.float32, device='cuda')\nB_torch = torch.randn(M, N, dtype=torch.float32, device='cuda')\nC_torch = torch.zeros(M, N, dtype=torch.float32, device='cuda')\n\n# Convert torch tensors to CuTe Tensors\nmA = from_dlpack(A_torch).mark_layout_dynamic()\nmB = from_dlpack(B_torch).mark_layout_dynamic()\nmC = from_dlpack(C_torch).mark_layout_dynamic()\n\n# Compile the kernel\ncompiled_kernel = cute.compile(elementwise_add_kernel, mA, mB, mC, (M, N))\n\n# Launch the kernel\n# A simple block/grid configuration. More complex kernels would use CuTe's layout algebra.\nblock_size = 256 # Example thread block size\ngrid_size = (M * N + block_size - 1) // block_size # Ensure enough blocks\n\ncompiled_kernel.launch(grid=[grid_size, 1, 1], block=[block_size, 1, 1])\n\n# Verify (optional, requires torch.testing)\ntry:\n    torch.testing.assert_close(C_torch, A_torch + B_torch)\n    print(\"Kernel executed successfully and results match!\")\nexcept AssertionError as e:\n    print(f\"Verification failed: {e}\")\n","lang":"python","description":"This quickstart demonstrates how to define a simple element-wise addition CUDA kernel using the CuTe DSL. It shows the use of the `@cute.kernel` decorator, `cute.Tensor` for arguments, accessing thread indices with `cute.arch.thread_idx()`, converting PyTorch tensors using `from_dlpack`, compiling the kernel with `cute.compile`, and launching it on the GPU."},"warnings":[{"fix":"Update `import cutlass` to `import cutlass_cppgen as cutlass` for the high-level C++ interface. The `cutlass.cute` import for the CuTe DSL remains unchanged.","message":"The legacy Python API package, previously named `cutlass` (e.g., `import cutlass`), was renamed to `cutlass_cppgen` in CUTLASS 4.2.0 (around September 2025). Direct imports of `cutlass` for the high-level C++ wrappers will fail.","severity":"breaking","affected_versions":"4.2.0 and later"},{"fix":"Always check the official CUTLASS documentation's 'Quick Start Guide' or 'Installation' section for the exact CUDA Toolkit and driver version required for your `nvidia-cutlass-dsl` version. For CUDA Toolkit 13.1, specific installation flags like `pip install nvidia-cutlass-dsl[cu13]` might be necessary.","message":"CUTLASS Python DSL (including `nvidia-cutlass-dsl-libs-base`) has strict compatibility requirements with specific CUDA Toolkit and NVIDIA driver versions. Mismatches can lead to runtime errors or compilation failures.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Users experiencing performance regressions should upgrade to version 4.3.5 or any later version (e.g., 4.4.x), where the issue was fixed.","message":"Unexpected CPU overhead was introduced in version 4.3.4 of the CuTe DSL.","severity":"gotcha","affected_versions":"4.3.4"},{"fix":"For `nvidia-cutlass-dsl` 4.4.2, Python 3.10 - 3.14 are supported. Always verify your Python version against the latest documentation for your specific CUTLASS DSL release.","message":"Initial releases of CUTLASS DSL 4.0 had limited Python version support (e.g., Python 3.12 only). While newer versions expand this, ensure your Python version is explicitly supported.","severity":"gotcha","affected_versions":"4.0.0 - 4.4.1"},{"fix":"Users on aarch64 utilizing `tvm-ffi` should ensure they are running `nvidia-cutlass-dsl` version 4.4.1 or newer to avoid stability issues.","message":"Version 4.4.1 fixed a segfault issue when using `tvm-ffi` on aarch64 systems.","severity":"gotcha","affected_versions":"Pre-4.4.1 (especially for aarch64 with `tvm-ffi`)"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}