SageAttention
SageAttention is a Python library providing accurate and efficient 8-bit plug-and-play attention mechanisms, including Mixture-of-Experts (MoE) implementations. It aims to accelerate large language models with minimal performance drop. The current bleeding-edge version is 2.0.1, though the PyPI package might lag behind GitHub releases. Releases typically occur when major architectural changes or significant features are implemented.
Common errors
-
ModuleNotFoundError: No module named 'sageattention.sagemoe'
cause The `sageattention` library, or specific sub-modules, are not installed or are not accessible in your current Python environment.fixEnsure the library is installed in your current environment: `pip install sageattention` (for stable PyPI) or `pip install git+https://github.com/thu-ml/SageAttention.git` (for latest GitHub). Verify your Python interpreter is pointing to the correct environment. -
RuntimeError: The size of tensor a (X) must match the size of tensor b (Y) at non-singleton dimension Z
cause Input tensors provided to SageAttention modules (e.g., `SageMoE`, `TransformerBlock`) do not have the expected dimensions or the `dim` parameter does not match the input's last dimension.fixVerify that the `dim` parameter during module initialization correctly matches the last dimension of your input tensor. Ensure batch and sequence dimensions are consistent across inputs if applicable. For example, if `dim=512`, your input should typically be `(batch, seq_len, 512)`. -
TypeError: __init__() got an unexpected keyword argument 'some_old_argument_name'
cause You are attempting to use constructor arguments or an API from a previous major version (e.g., pre-2.0.0) with a newer installed version of SageAttention.fixThis error frequently signals a breaking change in the API. Consult the latest GitHub README or the source code for the specific module to find the correct constructor arguments for your installed version. You may need to adapt your initialization code.
Warnings
- breaking Major architectural changes were introduced in v2.0.0. This update added support for MoE and new models, significantly altering internal structures and potentially public APIs for lower-level components.
- gotcha The PyPI package version (currently 1.0.6) significantly lags behind the latest GitHub releases (currently 2.0.1). This means `pip install sageattention` might not give you the features or fixes shown in the latest documentation or GitHub issues.
- gotcha Incorrect tensor shapes are a common source of runtime errors when working with attention and MoE modules.
Install
-
pip install sageattention -
pip install git+https://github.com/thu-ml/SageAttention.git
Imports
- SageMoE
from sageattention.sagemoe.moe_layer import SageMoE
- TransformerBlock
from sageattention.sagemoe.transformer_block import TransformerBlock
- SageAttentionForCausalLM
from sageattention.models import SageAttentionForCausalLM
Quickstart
import torch
from sageattention.sagemoe.moe_layer import SageMoE
from sageattention.sagemoe.transformer_block import TransformerBlock
# Example for SageMoE
# Initialize a Mixture-of-Experts layer
moe_model = SageMoE(dim=512, num_experts=8, top_k=2)
# Create a dummy input tensor
x_moe = torch.randn(1, 10, 512) # (batch_size, sequence_length, embedding_dimension)
# Pass input through the MoE layer
output_moe = moe_model(x_moe)
print(f"SageMoE Output Shape: {output_moe.shape}")
# Example for TransformerBlock
# Initialize a Transformer block with attention and MoE
transformer_block = TransformerBlock(dim=512, heads=8, dim_head=64, ff_mult=4, num_experts=8, top_k=2)
# Create a dummy input tensor
x_transformer = torch.randn(1, 10, 512)
# Pass input through the Transformer block
output_transformer = transformer_block(x_transformer)
print(f"TransformerBlock Output Shape: {output_transformer.shape}")