Transformer Engine

2.13.0 · active · verified Thu Apr 16

Transformer Engine (TE) is a library developed by NVIDIA for accelerating Transformer models on NVIDIA GPUs. It enables the use of 8-bit floating point (FP8) and 4-bit floating point (NVFP4) precision on architectures like Hopper, Ada, and Blackwell, significantly improving performance and reducing memory utilization during both training and inference. TE provides highly optimized building blocks for common Transformer architectures and an automatic mixed-precision-like API that integrates seamlessly with PyTorch and JAX. The library has frequent releases, often aligned with updates to NVIDIA's deep learning software stack.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize `transformer_engine.pytorch.Linear` and `transformer_engine.pytorch.TransformerLayer` modules and perform a forward pass using `fp8_autocast` for 8-bit floating point precision. It highlights the use of `torch.bfloat16` as a base precision and includes a basic FP8 recipe configuration. Ensure you have an NVIDIA GPU with CUDA installed.

import torch
import transformer_engine.pytorch as te
from transformer_engine.pytorch import fp8_autocast
from transformer_engine.common import recipe

# Check for GPU availability
if not torch.cuda.is_available():
    print("CUDA not available. Transformer Engine requires an NVIDIA GPU.")
    exit()

# Define model dimensions
in_features = 1024
out_features = 2048
batch_size = 16
sequence_length = 128

# Create a sample input tensor
input_tensor = torch.randn(batch_size, sequence_length, in_features, device='cuda', dtype=torch.bfloat16)

# Initialize a Transformer Engine Linear layer
te_linear = te.Linear(in_features, out_features, bias=True, dtype=torch.bfloat16).cuda()

# Define an FP8 recipe (optional, for fine-grained control)
fp8_recipe = recipe.DelayedScaling(margin=0, interval=1, fp8_format=recipe.Format.E4M3, amax_history_len=1024)

print(f"Input tensor shape: {input_tensor.shape}, dtype: {input_tensor.dtype}")

# Perform a forward pass with FP8 autocasting
with fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    output_tensor = te_linear(input_tensor)

print(f"Output tensor shape (with FP8 autocast): {output_tensor.shape}, dtype: {output_tensor.dtype}")

# Example of using a TransformerLayer
num_heads = 16
hidden_size = in_features
ffn_hidden_size = out_features

# TransformerLayer requires a specific config
te_transformer_layer = te.TransformerLayer(
    hidden_size=hidden_size,
    ffn_hidden_size=ffn_hidden_size,
    num_attention_heads=num_heads,
    fuse_qkv_params=True, # Common optimization
    params_dtype=torch.bfloat16
).cuda()

with fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    # TransformerLayer typically takes (sequence_length, batch_size, hidden_size)
    # Permute input_tensor for TransformerLayer if needed
    output_transformer_layer = te_transformer_layer(input_tensor.transpose(0, 1))

print(f"Output from TransformerLayer (with FP8 autocast): {output_transformer_layer.shape}, dtype: {output_transformer_layer.dtype}")

view raw JSON →