{"id":27743,"library":"flash-attn-4","title":"Flash Attention 4 (CUTE implementation)","description":"Flash Attention 4 is the next-generation implementation of the Flash Attention algorithm using NVIDIA CUTE (CUDA Template Engine). It provides highly optimized fused attention kernels for modern GPUs, supporting head dimensions up to 256 and various data types including FP8. Version 4.0.0b12 is in beta, with frequent releases.","status":"active","version":"4.0.0b12","language":"python","source_language":"en","source_url":"https://github.com/Dao-AILab/flash-attention","tags":["flash attention","CUTE","CUDA","attention","transformer"],"install":[{"cmd":"pip install flash-attn-4","lang":"bash","label":"PyPI install"}],"dependencies":[{"reason":"Required for tensor operations and GPU support.","package":"torch","optional":false}],"imports":[{"note":"Flash Attention 4 uses a separate PyPI package 'flash-attn-4' with module name 'flash_attn_4'. Importing from old 'flash_attn' imports Flash Attention 2/3.","wrong":"from flash_attn import flash_attn_func","symbol":"flash_attn_func","correct":"from flash_attn_4 import flash_attn_func"},{"note":"Correct import for variable-length sequences.","wrong":null,"symbol":"flash_attn_varlen_func","correct":"from flash_attn_4 import flash_attn_varlen_func"}],"quickstart":{"code":"import torch\nfrom flash_attn_4 import flash_attn_func\n\nq = torch.randn(1, 4, 128, 64, device='cuda', dtype=torch.float16)\nk = torch.randn(1, 4, 128, 64, device='cuda', dtype=torch.float16)\nv = torch.randn(1, 4, 128, 64, device='cuda', dtype=torch.float16)\nout, lse = flash_attn_func(q, k, v, causal=True)\nprint(out.shape)","lang":"python","description":"Basic forward pass with causal masking."},"warnings":[{"fix":"Update code to unpack the tuple: out, lse = flash_attn_func(...)","message":"Flash Attention 4 is a completely new implementation using CUTE. The API has changed; functions like flash_attn_func now return a tuple (out, lse) instead of just out.","severity":"breaking","affected_versions":">=4.0.0b1"},{"fix":"Use 'pip install flash-attn-4' and 'import flash_attn_4'.","message":"The PyPI package name is 'flash-attn-4', and the Python module is 'flash_attn_4'. Do not confuse with the old 'flash-attn' package (Flash Attention 2/3).","severity":"gotcha","affected_versions":"all"},{"fix":"Check GPU compute capability via torch.cuda.get_device_capability(). Minimum 8.0 required.","message":"Flash Attention 4 only supports CUDA GPUs with compute capability 8.0+ (Ampere, Hopper, Blackwell). It will fail on older GPUs.","severity":"gotcha","affected_versions":"all"}],"env_vars":null,"last_verified":"2026-05-09T00:00:00.000Z","next_check":"2026-08-07T00:00:00.000Z","problems":[{"fix":"Run: pip install flash-attn-4","cause":"Installed wrong package: installed 'flash-attn' (FA2/3) instead of 'flash-attn-4'.","error":"ModuleNotFoundError: No module named 'flash_attn_4'"},{"fix":"Ensure q, k, v are all CUDA tensors: q = q.cuda() etc.","cause":"Tensors not moved to GPU before calling flash_attn_func.","error":"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!"}],"ecosystem":"pypi","meta_description":null,"install_score":null,"install_tag":null,"quickstart_score":null,"quickstart_tag":null}