XFormers
XFormers is a PyTorch-based library providing a collection of composable, optimized building blocks for Transformer models. It aims to accelerate deep learning research by offering flexible and highly efficient components, including advanced attention mechanisms and fused operations that often outperform native PyTorch implementations in terms of speed and memory usage. Actively developed by Meta Platforms, Inc., the library frequently releases updates, with the current stable version being 0.0.35.
Warnings
- breaking Strict compatibility requirements with PyTorch and CUDA versions. Installing 'xformers' via pip without specifying a PyTorch index URL can lead to incompatibility issues or an unwanted PyTorch upgrade.
- gotcha Many xFormers optimizations, particularly `memory_efficient_attention`, can produce non-deterministic results, meaning repeated runs with the same inputs might yield slightly different outputs.
- breaking Dropped support for V100 and older NVIDIA GPUs, following PyTorch's deprecation schedule. Flash-Attention 2 support for building as part of xFormers is also deprecated.
- deprecated Many classes and modules within `xformers.factory`, `xformers.triton`, and `xformers.components` have been or will be deprecated.
- breaking The `memory_efficient_attention` function now expects the `attn_bias` argument to explicitly have a head dimension. It no longer automatically broadcasts batch/head dimensions for `attn_bias`.
- gotcha Building xFormers from source on Windows can be complex due to dependencies on Visual Studio Build Tools, specific CUDA Toolkit versions, and potential long path issues. Pre-built wheels are highly recommended.
Install
-
pip install xformers -
pip install -U xformers --index-url https://download.pytorch.org/whl/cu126 -
pip install --pre -U xformers -
pip install ninja pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
Imports
- memory_efficient_attention
from xformers.ops import memory_efficient_attention
- LowerTriangularMask
from xformers.ops.fmha.attn_bias import LowerTriangularMask
- AttentionOpBase
from xformers.ops import AttentionOpBase
Quickstart
import torch
from xformers.ops import memory_efficient_attention, LowerTriangularMask
# Ensure tensors are on CUDA if available
device = "cuda" if torch.cuda.is_available() else "cpu"
# Assume batch_size=2, seq_len=128, num_heads=8, head_dim=64
batch_size = 2
seq_len = 128
num_heads = 8
head_dim = 64
# Create dummy query, key, value tensors
# xFormers memory_efficient_attention typically expects (batch_size, seq_len, num_heads, head_dim)
query = torch.randn(batch_size, seq_len, num_heads, head_dim, device=device)
key = torch.randn(batch_size, seq_len, num_heads, head_dim, device=device)
value = torch.randn(batch_size, seq_len, num_heads, head_dim, device=device)
# It's common to use float16 (half precision) for performance with xFormers
query = query.half()
key = key.half()
value = value.half()
# Example 1: Standard memory-efficient attention
# xFormers automatically dispatches to the best available operator
output_attn = memory_efficient_attention(query, key, value)
print(f"Output attention shape (standard): {output_attn.shape}")
# Example 2: Causal attention with a lower triangular mask
# Note: The attn_bias argument structure has changed in newer versions (e.g., v0.0.21+)
# For LowerTriangularMask, it often handles internal expansion if num_heads is implicitly available.
attn_bias = LowerTriangularMask()
output_causal_attn = memory_efficient_attention(query, key, value, attn_bias=attn_bias)
print(f"Output attention shape (causal): {output_causal_attn.shape}")
# To verify installation and available kernels:
# import subprocess
# subprocess.run(["python", "-m", "xformers.info"])