{"id":8727,"library":"transformer-engine","title":"Transformer Engine","description":"Transformer Engine (TE) is a library developed by NVIDIA for accelerating Transformer models on NVIDIA GPUs. It enables the use of 8-bit floating point (FP8) and 4-bit floating point (NVFP4) precision on architectures like Hopper, Ada, and Blackwell, significantly improving performance and reducing memory utilization during both training and inference. TE provides highly optimized building blocks for common Transformer architectures and an automatic mixed-precision-like API that integrates seamlessly with PyTorch and JAX. The library has frequent releases, often aligned with updates to NVIDIA's deep learning software stack.","status":"active","version":"2.13.0","language":"en","source_language":"en","source_url":"https://github.com/NVIDIA/TransformerEngine","tags":["pytorch","jax","gpu","transformer","deep-learning","fp8","mixed-precision","nvidia","acceleration"],"install":[{"cmd":"pip install transformer-engine","lang":"bash","label":"PyPI (requires CUDA and NVIDIA GPU)"}],"dependencies":[{"reason":"Primary deep learning framework integration for PyTorch API.","package":"torch","optional":true},{"reason":"Deep learning framework integration for JAX API.","package":"jax","optional":true},{"reason":"Requires NVIDIA CUDA Toolkit for GPU acceleration.","package":"cuda","optional":false},{"reason":"Optional dependency for FlashAttention integration to further improve performance.","package":"flash-attn","optional":true}],"imports":[{"note":"Commonly used for linear layers, replacing torch.nn.Linear for TE optimizations.","symbol":"te.Linear","correct":"import transformer_engine.pytorch as te\nlinear_layer = te.Linear(in_features, out_features)"},{"note":"Optimized LayerNorm implementation for Transformer Engine.","symbol":"te.LayerNorm","correct":"import transformer_engine.pytorch as te\nlayer_norm = te.LayerNorm(normalized_shape)"},{"note":"Context manager to enable FP8 precision for forward passes within Transformer Engine modules. Should wrap forward pass only.","symbol":"fp8_autocast","correct":"from transformer_engine.pytorch import fp8_autocast\nwith fp8_autocast():\n    output = model(input)"},{"note":"Ready-to-use module for a complete Transformer layer, replacing multiple PyTorch modules with fused, optimized versions.","symbol":"te.TransformerLayer","correct":"import transformer_engine.pytorch as te\ntransformer_block = te.TransformerLayer(...) "}],"quickstart":{"code":"import torch\nimport transformer_engine.pytorch as te\nfrom transformer_engine.pytorch import fp8_autocast\nfrom transformer_engine.common import recipe\n\n# Check for GPU availability\nif not torch.cuda.is_available():\n    print(\"CUDA not available. Transformer Engine requires an NVIDIA GPU.\")\n    exit()\n\n# Define model dimensions\nin_features = 1024\nout_features = 2048\nbatch_size = 16\nsequence_length = 128\n\n# Create a sample input tensor\ninput_tensor = torch.randn(batch_size, sequence_length, in_features, device='cuda', dtype=torch.bfloat16)\n\n# Initialize a Transformer Engine Linear layer\nte_linear = te.Linear(in_features, out_features, bias=True, dtype=torch.bfloat16).cuda()\n\n# Define an FP8 recipe (optional, for fine-grained control)\nfp8_recipe = recipe.DelayedScaling(margin=0, interval=1, fp8_format=recipe.Format.E4M3, amax_history_len=1024)\n\nprint(f\"Input tensor shape: {input_tensor.shape}, dtype: {input_tensor.dtype}\")\n\n# Perform a forward pass with FP8 autocasting\nwith fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):\n    output_tensor = te_linear(input_tensor)\n\nprint(f\"Output tensor shape (with FP8 autocast): {output_tensor.shape}, dtype: {output_tensor.dtype}\")\n\n# Example of using a TransformerLayer\nnum_heads = 16\nhidden_size = in_features\nffn_hidden_size = out_features\n\n# TransformerLayer requires a specific config\nte_transformer_layer = te.TransformerLayer(\n    hidden_size=hidden_size,\n    ffn_hidden_size=ffn_hidden_size,\n    num_attention_heads=num_heads,\n    fuse_qkv_params=True, # Common optimization\n    params_dtype=torch.bfloat16\n).cuda()\n\nwith fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):\n    # TransformerLayer typically takes (sequence_length, batch_size, hidden_size)\n    # Permute input_tensor for TransformerLayer if needed\n    output_transformer_layer = te_transformer_layer(input_tensor.transpose(0, 1))\n\nprint(f\"Output from TransformerLayer (with FP8 autocast): {output_transformer_layer.shape}, dtype: {output_transformer_layer.dtype}\")","lang":"python","description":"This quickstart demonstrates how to initialize `transformer_engine.pytorch.Linear` and `transformer_engine.pytorch.TransformerLayer` modules and perform a forward pass using `fp8_autocast` for 8-bit floating point precision. It highlights the use of `torch.bfloat16` as a base precision and includes a basic FP8 recipe configuration. Ensure you have an NVIDIA GPU with CUDA installed."},"warnings":[{"fix":"Update C++ code or custom integrations to use the non-packed fused attention C APIs. Refer to the v2.13 release notes for specific migration details.","message":"Transformer Engine v2.13 removed deprecated packed fused attention C APIs (nvte_fused_attn_{fwd,bwd}_{qkvpacked,kvpacked}). Users must migrate to the non-packed API variants.","severity":"breaking","affected_versions":">=2.13.0"},{"fix":"Review and invert boolean logic for padding masks in PyTorch code if it was written for versions prior to v1.7. `True` now *excludes* positions.","message":"In Transformer Engine v1.7, the padding mask definition for PyTorch changed. `True` now means masking out the corresponding position, while `False` means including it. This unifies mask definitions across supported frameworks.","severity":"breaking","affected_versions":">=1.7.0"},{"fix":"Update `InferenceParams` initialization and usage according to the v2.2 release notes, ensuring all new required arguments are provided and `pre_step` is called. Replace `swap_key_value_dict` usage with the new automatic reordering in `step`.","message":"Transformer Engine v2.2 introduced multiple breaking changes in the `InferenceParams` class, requiring new arguments (`num_heads_kv`, `head_dim_k`, `dtype`) during initialization and requiring a call to `pre_step` to update the state. The `swap_key_value_dict` method was also removed.","severity":"breaking","affected_versions":">=2.2.0"},{"fix":"Avoid CPU offloading for weight tensors. When installing, always use `pip install transformer-engine --no-build-isolation` to prepare for future releases.","message":"Transformer Engine v2.3 deprecated CPU offloading weight tensors. Support for installations *without* the `--no-build-isolation` flag will also be removed in a future release.","severity":"deprecated","affected_versions":">=2.3.0"},{"fix":"For optimal FP8 performance, use Transformer Engine with larger models and batch sizes where the computational benefits outweigh casting and CPU overheads. Ensure GPU compute can cover CPU overheads by avoiding frequent GPU synchronization.","message":"FP8 execution might be slower than FP16/BF16 for small models or batch sizes due to overheads from FP8 casts and increased CPU overhead from `te.Linear`'s additional logic compared to `torch.nn.Linear`.","severity":"gotcha","affected_versions":"All"},{"fix":"Ensure your CUDA installation is 12.8 or newer. If not, install `transformer-engine` from source as a temporary workaround until this issue is fully addressed in later releases.","message":"`ModuleNotFoundError` may occur if `transformer-engine` is installed via PyPI in an environment with CUDA version less than 12.8.","severity":"gotcha","affected_versions":"<2.3.0"},{"fix":"Be aware that FlashAttention may not be used for cross-attention with causal masking. If this specific scenario is critical for performance, consider alternative attention implementations or inspect your FlashAttention version.","message":"FlashAttention v2.1 and later changed the behavior of the causal mask when performing cross-attention. To maintain consistent behavior across Transformer Engine versions and backends, FlashAttention is *disabled* for this specific use case (cross-attention with causal masking) when v2.1+ is installed.","severity":"gotcha","affected_versions":">=1.2.1"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Ensure you have an NVIDIA GPU and CUDA installed. Try `pip install transformer-engine`. If issues persist, consider using NVIDIA NGC PyTorch Docker images, which come with Transformer Engine pre-installed and optimized.","cause":"The `transformer-engine` library is not correctly installed or not accessible in the current Python environment, or the required CUDA/GPU setup is missing.","error":"ModuleNotFoundError: No module named 'transformer_engine'"},{"fix":"Run with larger models and batch sizes to amortize the FP8 overhead. Profile your application to identify CPU bottlenecks and ensure that GPU compute operations can effectively hide CPU overheads.","cause":"This often occurs when using small models or batch sizes, where the overhead of FP8 casting and `te.Linear`'s additional logic outweighs the performance benefits. CPU overhead can also be a factor.","error":"Transformer Engine FP8 Linear Functions are slower than PyTorch's built-in linear API."},{"fix":"Upgrade to Transformer Engine v2.13 or later, which includes a fix for this build issue. Alternatively, ensure NCCL is installed in a manner compatible with the build system.","cause":"An incompatibility during the build process when NCCL is installed in a specific way from PyPI, affecting how build tools locate necessary files.","error":"TypeError during build when NCCL is installed from PyPI as a namespace package without a __file__ attribute."},{"fix":"Update your CUDA installation to version 12.8 or higher. If upgrading CUDA is not feasible, install Transformer Engine from source to build it against your specific CUDA version.","cause":"Compatibility issues between the PyPI-distributed Transformer Engine binary and older CUDA versions.","error":"Transformer Engine may crash when it is installed via the PyPI registry but is run in an environment with CUDA version < 12.8."}]}