{"id":7801,"library":"transformer-engine-cu12","title":"Transformer Engine (CUDA 12)","description":"Transformer Engine (TE) is a Python library by NVIDIA for accelerating Transformer models on NVIDIA GPUs. It enables lower precision training and inference, notably supporting 8-bit (FP8) and 4-bit (NVFP4) floating point precision on Hopper, Ada, and Blackwell GPUs, leading to better performance and reduced memory utilization. It provides highly optimized building blocks for popular Transformer architectures and an automatic mixed precision-like API for PyTorch and JAX. The current version is 2.13.0, with an active release cadence, often aligning with new NVIDIA hardware and software advancements.","status":"active","version":"2.13.0","language":"en","source_language":"en","source_url":"https://github.com/NVIDIA/TransformerEngine","tags":["AI/ML","Deep Learning","Transformers","NVIDIA","CUDA","Performance","Mixed Precision","FP8","FP4","PyTorch","JAX"],"install":[{"cmd":"pip install --no-build-isolation transformer-engine-cu12[pytorch]","lang":"bash","label":"PyTorch (recommended)"},{"cmd":"pip install --no-build-isolation transformer-engine-cu12[jax]","lang":"bash","label":"JAX"},{"cmd":"pip install --no-build-isolation transformer-engine-cu12[core]","lang":"bash","label":"Core library only"}],"dependencies":[{"reason":"Required for the library to run.","package":"python","version":">=3.10.0","optional":false},{"reason":"Required for PyTorch integration. Installed as an extra dependency.","package":"pytorch","optional":true},{"reason":"Required for JAX integration. Installed as an extra dependency.","package":"jax","optional":true},{"reason":"NVIDIA CUDA Toolkit 12.1+ (12.8+ for Blackwell GPUs) with compatible NVIDIA drivers.","package":"CUDA","version":">=12.1","optional":false},{"reason":"cuDNN 9.3+ is required for optimal performance.","package":"cuDNN","version":">=9.3","optional":false}],"imports":[{"symbol":"Linear","correct":"from transformer_engine.pytorch import Linear"},{"symbol":"LayerNorm","correct":"from transformer_engine.pytorch import LayerNorm"},{"symbol":"TransformerLayer","correct":"from transformer_engine.pytorch import TransformerLayer"},{"note":"The `fp8_autocast` context manager is framework-specific and located within the `pytorch` or `jax` submodule.","wrong":"from transformer_engine.fp8 import fp8_autocast","symbol":"fp8_autocast","correct":"from transformer_engine.pytorch.fp8 import fp8_autocast"}],"quickstart":{"code":"import torch\nfrom transformer_engine.pytorch import Linear, fp8_autocast\n\n# Dummy input tensor\ninput_tensor = torch.randn(16, 128, device='cuda', dtype=torch.float16)\n\n# Initialize a Transformer Engine Linear layer\nte_linear_layer = Linear(128, 256, bias=True, dtype=torch.float16).cuda()\n\n# Perform a forward pass with FP8 autocasting\nwith fp8_autocast():\n    output_tensor = te_linear_layer(input_tensor)\n\nprint(f\"Input shape: {input_tensor.shape}, dtype: {input_tensor.dtype}\")\nprint(f\"Output shape: {output_tensor.shape}, dtype: {output_tensor.dtype}\")\nassert output_tensor.dtype == torch.float8_e4m3fn or output_tensor.dtype == torch.float8_e5m2, \"Output should be FP8 or similar based on precision policy.\"\nprint(\"Quickstart example ran successfully with FP8 autocasting.\")","lang":"python","description":"This quickstart demonstrates how to use `transformer_engine.pytorch.Linear` with FP8 autocasting. Ensure you have PyTorch and a compatible CUDA environment set up. The `fp8_autocast` context manager automatically handles FP8 quantization for supported operations within its scope."},"warnings":[{"fix":"Review the Transformer Engine 2.2 release notes for `InferenceParams` and `DelayedScaling` API updates. Adjust code to use new required arguments and method calls.","message":"Breaking changes in `InferenceParams` and removal of the `interval` argument for `DelayedScaling` in PyTorch. `num_heads_kv`, `head_dim_k`, and `dtype` are now required for `InferenceParams` initialization, and `pre_step` must be called.","severity":"breaking","affected_versions":">=2.2.0"},{"fix":"Update C++ code to use the non-packed fused attention APIs. Consult Transformer Engine's C++ API documentation for the correct alternatives.","message":"The deprecated packed fused attention C APIs (`nvte_fused_attn_{fwd,bwd}_{qkvpacked,kvpacked}`) have been removed. Users must migrate to the non-packed API variants.","severity":"breaking","affected_versions":">=2.13.0"},{"fix":"Always include `--no-build-isolation` in your `pip install` commands for Transformer Engine to ensure future compatibility and prevent potential build issues. E.g., `pip install --no-build-isolation transformer-engine-cu12[pytorch]`.","message":"The installation of Transformer Engine now requires the `--no-build-isolation` flag when using PyPI or building from source. Support for installations *with* build isolation will be removed in a future release.","severity":"deprecated","affected_versions":">=2.3.0"},{"fix":"Ensure PyTorch and Transformer Engine are built with the same C++ ABI. Rebuilding PyTorch from source with a matching ABI might be necessary, or use NVIDIA NGC Docker containers where these dependencies are pre-configured.","message":"ABI compatibility issues can arise if PyTorch and Transformer Engine are built with different C++ ABI settings, especially outside of NGC containers. This leads to `ImportError` with undefined symbols.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure your CUDA environment is at least 12.8 or higher when installing from PyPI. If you must use an older CUDA 12.x version (e.g., 12.1), consider installing Transformer Engine from source and explicitly managing CUDA paths during the build process, or use an NGC container.","message":"Installing `transformer-engine-cu12` via PyPI may crash in environments with CUDA version < 12.8, despite the `cu12` suffix implying CUDA 12 support generally.","severity":"gotcha","affected_versions":">=2.2.0"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Install with the appropriate extra dependency: `pip install --no-build-isolation transformer-engine-cu12[pytorch]` (for PyTorch) or `pip install --no-build-isolation transformer-engine-cu12[jax]` (for JAX).","cause":"The framework-specific bindings for PyTorch (or JAX) were not installed. Installing `transformer-engine-cu12` by itself only provides the core library, not the Python bindings for deep learning frameworks.","error":"ModuleNotFoundError: No module named 'transformer_engine.pytorch'"},{"fix":"Verify that both PyTorch and Transformer Engine are built with compatible C++ ABIs. The simplest solution is often to use the NVIDIA NGC PyTorch or JAX Docker containers, which come pre-configured with compatible versions. If installing from source, ensure consistent compiler flags.","cause":"This error typically indicates an C++ ABI incompatibility between PyTorch and Transformer Engine. They were compiled with different C++ standards or settings.","error":"ImportError: undefined symbol: _ZN3c106cuda9SetDeviceEi"},{"fix":"Install cuDNN 9.3+ and ensure that environment variables like `CUDNN_PATH`, `CUDNN_HOME`, and `LD_LIBRARY_PATH` correctly point to your cuDNN installation. For example: `export CUDNN_PATH=/path/to/cudnn`, `export CUDNN_HOME=$CUDNN_PATH`, `export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$LD_LIBRARY_PATH`.","cause":"The CUDNN headers are not found by the build system during installation. This can happen if cuDNN is not installed, or its path is not correctly exposed to the build environment.","error":"fatal error: cudnn.h: No such file or directory"},{"fix":"First, ensure CUDA Toolkit (12.1+), NVIDIA drivers, and cuDNN (9.3+) are correctly installed and configured. Check that `nvcc` is in your `PATH` or `CUDA_PATH` environment variable is set (e.g., `export CUDA_PATH=/usr/local/cuda`). If the error persists, especially when building FlashAttention, try `export MAX_JOBS=1` before installation to reduce memory usage during compilation: `MAX_JOBS=1 pip install --no-build-isolation transformer-engine-cu12[pytorch]`.","cause":"This generic error during `pip install` often masks underlying issues with CMake, CUDA Toolkit, `nvcc` path, or FlashAttention compilation resource intensity.","error":"ERROR: Failed building wheel for transformer-engine"}]}