SageAttention

2.0.1 · active · verified Fri Apr 17

SageAttention is a Python library providing accurate and efficient 8-bit plug-and-play attention mechanisms, including Mixture-of-Experts (MoE) implementations. It aims to accelerate large language models with minimal performance drop. The current bleeding-edge version is 2.0.1, though the PyPI package might lag behind GitHub releases. Releases typically occur when major architectural changes or significant features are implemented.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to instantiate and use the core `SageMoE` (Mixture-of-Experts) layer and a `TransformerBlock` which internally uses SageAttention. It initializes dummy input tensors and shows the output shapes.

import torch
from sageattention.sagemoe.moe_layer import SageMoE
from sageattention.sagemoe.transformer_block import TransformerBlock

# Example for SageMoE
# Initialize a Mixture-of-Experts layer
moe_model = SageMoE(dim=512, num_experts=8, top_k=2)
# Create a dummy input tensor
x_moe = torch.randn(1, 10, 512) # (batch_size, sequence_length, embedding_dimension)
# Pass input through the MoE layer
output_moe = moe_model(x_moe)
print(f"SageMoE Output Shape: {output_moe.shape}")

# Example for TransformerBlock
# Initialize a Transformer block with attention and MoE
transformer_block = TransformerBlock(dim=512, heads=8, dim_head=64, ff_mult=4, num_experts=8, top_k=2)
# Create a dummy input tensor
x_transformer = torch.randn(1, 10, 512)
# Pass input through the Transformer block
output_transformer = transformer_block(x_transformer)
print(f"TransformerBlock Output Shape: {output_transformer.shape}")

view raw JSON →