{"id":4359,"library":"flash-attn","title":"Flash Attention","description":"Flash Attention is a fast and memory-efficient exact attention mechanism for deep learning models, particularly Transformers. It reorders the attention computation to reduce the number of memory accesses, making it significantly faster and less memory-intensive than standard attention. The library is currently stable at version 2.8.3, with an active beta development for version 4.0.0 which introduces new features and architectural changes. Its release cadence is driven by research advancements and performance optimizations.","status":"active","version":"2.8.3","language":"en","source_language":"en","source_url":"https://github.com/Dao-AILab/flash-attention","tags":["attention","transformer","cuda","gpu","deep-learning","pytorch","optimization","ai"],"install":[{"cmd":"pip install flash-attn --no-build-isolation","lang":"bash","label":"Recommended for CUDA support"},{"cmd":"pip install flash-attn --no-cuda-extensions","lang":"bash","label":"For CPU-only environments"}],"dependencies":[{"reason":"Core deep learning framework","package":"torch","optional":false},{"reason":"Required for GPU acceleration and primary performance benefits","package":"cuda","optional":true},{"reason":"Often helps with C++/CUDA extension compilation if `pip` struggles","package":"ninja","optional":true}],"imports":[{"symbol":"flash_attn_func","correct":"from flash_attn import flash_attn_func"},{"symbol":"flash_attn_qkvpacked_func","correct":"from flash_attn import flash_attn_qkvpacked_func"},{"symbol":"flash_attn_varlen_func","correct":"from flash_attn import flash_attn_varlen_func"},{"symbol":"FlashAttention2","correct":"from flash_attn.modules.mha import FlashAttention2"}],"quickstart":{"code":"import torch\nfrom flash_attn import flash_attn_func\n\n# Example for q, k, v as separate tensors\nbatch_size = 2\nseq_len = 128\nnum_heads = 8\nhead_dim = 64 # Must be multiple of 8, typically <= 256\n\ndtype = torch.float16 # FlashAttention works best with float16 or bfloat16\ndevice = 'cuda' if torch.cuda.is_available() else 'cpu'\n\nif device == 'cpu':\n    print(\"Warning: FlashAttention is primarily designed for CUDA GPUs.\")\n\nq = torch.randn(batch_size, seq_len, num_heads, head_dim, dtype=dtype, device=device)\nk = torch.randn(batch_size, seq_len, num_heads, head_dim, dtype=dtype, device=device)\nv = torch.randn(batch_size, seq_len, num_heads, head_dim, dtype=dtype, device=device)\n\n# Causal attention (for language models)\noutput = flash_attn_func(q, k, v, dropout_p=0.0, softmax_scale=None, causal=True)\n\nprint(\"Output shape:\", output.shape)\nprint(\"Output device:\", output.device)","lang":"python","description":"This quickstart demonstrates how to use `flash_attn_func` with separate query, key, and value tensors. It highlights the importance of data types (float16/bfloat16) and device placement (CUDA) for optimal performance. The `causal=True` argument is common for generative models. Ensure your `head_dim` is a multiple of 8 and ideally no more than 256."},"warnings":[{"fix":"Refer to the official documentation or GitHub README for the exact function signature matching your installed version. Pay close attention to `qkv` vs `q, k, v` inputs and boolean flags.","message":"The API for `flash_attn_func` has changed significantly between v1, v2, and the v4 beta, including argument order, default values, and added parameters (e.g., `softmax_scale`, `dropout_p`, `causal`, different return values). Code written for v1 or early v2 will likely break on later v2 or v4.","severity":"breaking","affected_versions":"<2.0, 2.x, 4.x (beta)"},{"fix":"Ensure your GPU supports the required CUDA architecture. Adjust `head_dim` to be a multiple of 8. For best performance, keep `head_dim <= 256`. Check the FlashAttention GitHub for exact hardware requirements.","message":"Flash Attention requires a specific CUDA architecture (SM70+ for v1/v2, SM80+ for v2.2+, SM90+ for v4 beta) and specific `head_dim` values. Typically, `head_dim` must be a multiple of 8 (e.g., 64, 128, 256) and for optimal performance, should not exceed 256. Using unsupported `head_dim` or CUDA architecture will result in runtime errors or fallbacks to slower implementations.","severity":"gotcha","affected_versions":"All"},{"fix":"Always use `pip install flash-attn --no-build-isolation` when installing with CUDA extensions. Ensure your PyTorch version and CUDA toolkit are compatible as specified by PyTorch. Consider installing `ninja` first (`pip install ninja`) for smoother compilation.","message":"Installation can be sensitive to your PyTorch and CUDA setup. Using `pip install flash-attn` without `--no-build-isolation` can lead to `flash-attn` compiling against a different CUDA toolkit than your PyTorch installation, causing runtime errors or crashes.","severity":"gotcha","affected_versions":"All"},{"fix":"If migrating to or from v4 beta, review the specific release notes and documentation for v4. Adapt your code to the new API, especially for module initialization and function calls.","message":"The FlashAttention v4 beta introduces new APIs and internal architecture changes. Code written for v2.x is NOT directly compatible with the v4 beta, and vice versa. Key changes include a redesigned `FlashAttention2` module and `flash_attn_func` with updated arguments to support new features like dynamic sequence lengths.","severity":"breaking","affected_versions":"4.0.0.betaX"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}