{"id":6643,"library":"flash-linear-attention","title":"Flash Linear Attention","description":"Flash Linear Attention (FLA) is a Python library providing efficient, Triton-based implementations for state-of-the-art linear attention models and emerging sequence modeling architectures. It aims for high-performance training and inference across NVIDIA, AMD, and Intel GPUs. As of version 0.4.2, the library is actively maintained with frequent releases, offering optimized kernels, fused modules, and integration-ready layers for PyTorch and Hugging Face models.","status":"active","version":"0.4.2","language":"en","source_language":"en","source_url":"https://github.com/fla-org/flash-linear-attention","tags":["attention","linear-attention","deep-learning","pytorch","triton","transformers","gpu-acceleration","sequence-modeling"],"install":[{"cmd":"pip install flash-linear-attention","lang":"bash","label":"Standard Installation"},{"cmd":"pip install torch triton einops transformers numpy\n# For AMD GPUs, ensure Triton ROCm backend is installed separately.\n# For Intel GPUs, ensure Triton XPU backend is installed separately.","lang":"bash","label":"With Core Dependencies (Advanced)"}],"dependencies":[{"reason":"Core deep learning framework dependency for models and operations.","package":"torch","optional":false},{"reason":"Required for high-performance, custom GPU kernels. Version compatibility with PyTorch is crucial.","package":"triton","optional":false},{"reason":"Used for flexible tensor operations.","package":"einops","optional":false},{"reason":"Provides integration-ready layers and models compatible with Hugging Face ecosystem.","package":"transformers","optional":false},{"reason":"Common scientific computing library, often an implicit dependency for ML libraries.","package":"numpy","optional":false}],"imports":[{"note":"A common import pattern for using attention layers.","symbol":"MultiScaleRetention","correct":"from fla.layers import MultiScaleRetention"},{"note":"Example import for a specific model architecture.","symbol":"FlashMamba","correct":"from fla.models import FlashMamba"}],"quickstart":{"code":"import torch\nfrom fla.layers import MultiScaleRetention\n\n# Example input tensor (batch_size, sequence_length, hidden_dim)\nbatch_size = 2\nsequence_length = 512\nhidden_dim = 128\n\n# Ensure CUDA is available and tensors are on GPU for optimal performance\nif torch.cuda.is_available():\n    input_tensor = torch.randn(batch_size, sequence_length, hidden_dim).cuda()\n    # Initialize the MultiScaleRetention layer\n    # d_model should match hidden_dim, num_heads defines the number of attention heads\n    model = MultiScaleRetention(d_model=hidden_dim, num_heads=4).cuda()\n    \n    # Forward pass\n    output_tensor = model(input_tensor)\n    \n    print(f\"Input shape: {input_tensor.shape}\")\n    print(f\"Output shape: {output_tensor.shape}\")\nelse:\n    print(\"CUDA is not available. Please ensure a compatible GPU and PyTorch installation.\")\n    print(\"Tensors and model should be moved to GPU for Flash Linear Attention.\")","lang":"python","description":"This quickstart demonstrates how to initialize and use a `MultiScaleRetention` layer from `flash-linear-attention` with a sample PyTorch tensor. It's crucial to run this on a CUDA-enabled GPU for the performance benefits of Triton kernels."},"warnings":[{"fix":"Ensure `flash-linear-attention` is installed for the full feature set. If encountering issues, consider reinstalling both `fla-core` and `flash-linear-attention` (or uninstalling previous versions first). Verify import paths for specific modules.","message":"Starting from v0.3.2, the `flash-linear-attention` package was split into `fla-core` (minimal dependencies) and `flash-linear-attention` (extension, including `fla/layers` and `fla/models`, depending on `transformers`). Users upgrading from older versions or relying on direct `fla.ops` imports may experience changes in dependency management or module resolution.","severity":"breaking","affected_versions":">=0.3.2"},{"fix":"Adjust input tensor shapes to conform to the new 'sequence-first' format (e.g., `(batch_size, sequence_length, hidden_dim)` instead of `(batch_size, num_heads, sequence_length, head_dim)` or similar head-first structures).","message":"In November 2024, the input tensor format was switched from 'head-first' to 'sequence-first'. This means the expected dimension order for input tensors to attention layers has changed.","severity":"breaking","affected_versions":"Post-November 2024 (roughly v0.4.0 and later)"},{"fix":"Always install PyTorch and Triton from official sources, ensuring their versions are compatible. The official `flash-linear-attention` FAQs often provide guidance. Consider using a fresh `conda` environment to isolate installations.","message":"Strict compatibility between PyTorch and Triton versions is required. Using incompatible versions can lead to `AttributeError` (e.g., `'NoneType' object has no attribute 'start'`) or `LinearLayout Assertion Error`. This is especially relevant for nightly builds or specific hardware (like ARM).","severity":"gotcha","affected_versions":"All versions"},{"fix":"Consult the Triton documentation or `flash-linear-attention` FAQs for instructions on installing the appropriate GPU-specific Triton backend for your hardware.","message":"For AMD and Intel GPUs, specific Triton ROCm or XPU backends are required, which might need separate installation steps beyond `pip install triton`. Without the correct backend, performance will be severely impacted or the library may not function.","severity":"gotcha","affected_versions":"All versions on AMD/Intel GPUs"},{"fix":"Ensure your environment uses Python 3.10 or a later version. Upgrade Python if necessary.","message":"The library explicitly requires Python 3.10 or newer. Older Python versions can lead to `AttributeError: 'NoneType' object has no attribute 'start'` during Triton kernel compilation.","severity":"gotcha","affected_versions":"<=0.4.2"},{"fix":"You can safely remove `causal-conv1d` from your dependencies if it was only used for `flash-linear-attention`.","message":"The external `causal-conv1d` library is no longer a required dependency as `flash-linear-attention` now provides its own Triton implementations for `conv1d` operations.","severity":"deprecated","affected_versions":">=0.4.0 (approximately)"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}