Transformer Engine
Transformer Engine (TE) is a library developed by NVIDIA for accelerating Transformer models on NVIDIA GPUs. It enables the use of 8-bit floating point (FP8) and 4-bit floating point (NVFP4) precision on architectures like Hopper, Ada, and Blackwell, significantly improving performance and reducing memory utilization during both training and inference. TE provides highly optimized building blocks for common Transformer architectures and an automatic mixed-precision-like API that integrates seamlessly with PyTorch and JAX. The library has frequent releases, often aligned with updates to NVIDIA's deep learning software stack.
Common errors
-
ModuleNotFoundError: No module named 'transformer_engine'
cause The `transformer-engine` library is not correctly installed or not accessible in the current Python environment, or the required CUDA/GPU setup is missing.fixEnsure you have an NVIDIA GPU and CUDA installed. Try `pip install transformer-engine`. If issues persist, consider using NVIDIA NGC PyTorch Docker images, which come with Transformer Engine pre-installed and optimized. -
Transformer Engine FP8 Linear Functions are slower than PyTorch's built-in linear API.
cause This often occurs when using small models or batch sizes, where the overhead of FP8 casting and `te.Linear`'s additional logic outweighs the performance benefits. CPU overhead can also be a factor.fixRun with larger models and batch sizes to amortize the FP8 overhead. Profile your application to identify CPU bottlenecks and ensure that GPU compute operations can effectively hide CPU overheads. -
TypeError during build when NCCL is installed from PyPI as a namespace package without a __file__ attribute.
cause An incompatibility during the build process when NCCL is installed in a specific way from PyPI, affecting how build tools locate necessary files.fixUpgrade to Transformer Engine v2.13 or later, which includes a fix for this build issue. Alternatively, ensure NCCL is installed in a manner compatible with the build system. -
Transformer Engine may crash when it is installed via the PyPI registry but is run in an environment with CUDA version < 12.8.
cause Compatibility issues between the PyPI-distributed Transformer Engine binary and older CUDA versions.fixUpdate your CUDA installation to version 12.8 or higher. If upgrading CUDA is not feasible, install Transformer Engine from source to build it against your specific CUDA version.
Warnings
- breaking Transformer Engine v2.13 removed deprecated packed fused attention C APIs (nvte_fused_attn_{fwd,bwd}_{qkvpacked,kvpacked}). Users must migrate to the non-packed API variants.
- breaking In Transformer Engine v1.7, the padding mask definition for PyTorch changed. `True` now means masking out the corresponding position, while `False` means including it. This unifies mask definitions across supported frameworks.
- breaking Transformer Engine v2.2 introduced multiple breaking changes in the `InferenceParams` class, requiring new arguments (`num_heads_kv`, `head_dim_k`, `dtype`) during initialization and requiring a call to `pre_step` to update the state. The `swap_key_value_dict` method was also removed.
- deprecated Transformer Engine v2.3 deprecated CPU offloading weight tensors. Support for installations *without* the `--no-build-isolation` flag will also be removed in a future release.
- gotcha FP8 execution might be slower than FP16/BF16 for small models or batch sizes due to overheads from FP8 casts and increased CPU overhead from `te.Linear`'s additional logic compared to `torch.nn.Linear`.
- gotcha `ModuleNotFoundError` may occur if `transformer-engine` is installed via PyPI in an environment with CUDA version less than 12.8.
- gotcha FlashAttention v2.1 and later changed the behavior of the causal mask when performing cross-attention. To maintain consistent behavior across Transformer Engine versions and backends, FlashAttention is *disabled* for this specific use case (cross-attention with causal masking) when v2.1+ is installed.
Install
-
pip install transformer-engine
Imports
- te.Linear
import transformer_engine.pytorch as te linear_layer = te.Linear(in_features, out_features)
- te.LayerNorm
import transformer_engine.pytorch as te layer_norm = te.LayerNorm(normalized_shape)
- fp8_autocast
from transformer_engine.pytorch import fp8_autocast with fp8_autocast(): output = model(input) - te.TransformerLayer
import transformer_engine.pytorch as te transformer_block = te.TransformerLayer(...)
Quickstart
import torch
import transformer_engine.pytorch as te
from transformer_engine.pytorch import fp8_autocast
from transformer_engine.common import recipe
# Check for GPU availability
if not torch.cuda.is_available():
print("CUDA not available. Transformer Engine requires an NVIDIA GPU.")
exit()
# Define model dimensions
in_features = 1024
out_features = 2048
batch_size = 16
sequence_length = 128
# Create a sample input tensor
input_tensor = torch.randn(batch_size, sequence_length, in_features, device='cuda', dtype=torch.bfloat16)
# Initialize a Transformer Engine Linear layer
te_linear = te.Linear(in_features, out_features, bias=True, dtype=torch.bfloat16).cuda()
# Define an FP8 recipe (optional, for fine-grained control)
fp8_recipe = recipe.DelayedScaling(margin=0, interval=1, fp8_format=recipe.Format.E4M3, amax_history_len=1024)
print(f"Input tensor shape: {input_tensor.shape}, dtype: {input_tensor.dtype}")
# Perform a forward pass with FP8 autocasting
with fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
output_tensor = te_linear(input_tensor)
print(f"Output tensor shape (with FP8 autocast): {output_tensor.shape}, dtype: {output_tensor.dtype}")
# Example of using a TransformerLayer
num_heads = 16
hidden_size = in_features
ffn_hidden_size = out_features
# TransformerLayer requires a specific config
te_transformer_layer = te.TransformerLayer(
hidden_size=hidden_size,
ffn_hidden_size=ffn_hidden_size,
num_attention_heads=num_heads,
fuse_qkv_params=True, # Common optimization
params_dtype=torch.bfloat16
).cuda()
with fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
# TransformerLayer typically takes (sequence_length, batch_size, hidden_size)
# Permute input_tensor for TransformerLayer if needed
output_transformer_layer = te_transformer_layer(input_tensor.transpose(0, 1))
print(f"Output from TransformerLayer (with FP8 autocast): {output_transformer_layer.shape}, dtype: {output_transformer_layer.dtype}")