Transformer Engine (CUDA 12)

2.13.0 · active · verified Thu Apr 16

Transformer Engine (TE) is a Python library by NVIDIA for accelerating Transformer models on NVIDIA GPUs. It enables lower precision training and inference, notably supporting 8-bit (FP8) and 4-bit (NVFP4) floating point precision on Hopper, Ada, and Blackwell GPUs, leading to better performance and reduced memory utilization. It provides highly optimized building blocks for popular Transformer architectures and an automatic mixed precision-like API for PyTorch and JAX. The current version is 2.13.0, with an active release cadence, often aligning with new NVIDIA hardware and software advancements.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `transformer_engine.pytorch.Linear` with FP8 autocasting. Ensure you have PyTorch and a compatible CUDA environment set up. The `fp8_autocast` context manager automatically handles FP8 quantization for supported operations within its scope.

import torch
from transformer_engine.pytorch import Linear, fp8_autocast

# Dummy input tensor
input_tensor = torch.randn(16, 128, device='cuda', dtype=torch.float16)

# Initialize a Transformer Engine Linear layer
te_linear_layer = Linear(128, 256, bias=True, dtype=torch.float16).cuda()

# Perform a forward pass with FP8 autocasting
with fp8_autocast():
    output_tensor = te_linear_layer(input_tensor)

print(f"Input shape: {input_tensor.shape}, dtype: {input_tensor.dtype}")
print(f"Output shape: {output_tensor.shape}, dtype: {output_tensor.dtype}")
assert output_tensor.dtype == torch.float8_e4m3fn or output_tensor.dtype == torch.float8_e5m2, "Output should be FP8 or similar based on precision policy."
print("Quickstart example ran successfully with FP8 autocasting.")

view raw JSON →