NVIDIA Tools Extension (NVTX) for CUDA 11 (Python Bindings)
NVTX (NVIDIA Tools Extension SDK) is a cross-platform C-based API with Python, C++, and Rust wrappers, used for annotating events, code ranges, and resources within applications. This allows developers to gain contextual information for performance analysis and visualization using NVIDIA developer tools such as Nsight Systems, Nsight Compute, and Nsight Graphics. The `nvidia-nvtx-cu11` package provides the NVTX Python bindings specifically compiled for CUDA 11 environments, enabling Python applications to leverage NVTX for profiling. It is currently at version 11.8.86. The underlying Python API is exposed via the `nvtx` module, which also has a separate PyPI distribution and a more frequent release cadence.
Warnings
- breaking NVTX v2 (e.g., nvToolsExt.h) is deprecated in favor of NVTX v3. Using older NVTX versions or C/C++ code compiled with NVTX v2 headers alongside newer CUDA toolkits (e.g., CUDA 12.x and above) can lead to compilation failures or undefined runtime behavior, especially with custom CUDA extensions. The Python bindings for NVTX generally use NVTX v3.
- gotcha Profiling Python applications that use the `multiprocessing` module with NVIDIA Nsight Systems and NVTX enabled can cause the script to hang indefinitely or result in incomplete profiling data on Linux. This is often due to the default 'fork' start method not being compatible with Nsight Systems' injection mechanism.
- gotcha While NVTX Python bindings offer automatic function annotation, enabling it introduces significant overhead (potentially more than 10x) to every function call. This can severely degrade application runtime performance.
- gotcha Creating NVTX domains is a relatively expensive operation. It is recommended to create domains sparingly, typically one per library or major application component, rather than for fine-grained event grouping.
- gotcha NVTX is designed for annotating host-side (CPU) events and ranges that interact with GPU operations. It is not supported for direct annotation within GPU code (e.g., CUDA `__device__` functions).
Install
-
pip install nvidia-nvtx-cu11 -
conda install -c conda-forge nvtx
Imports
- nvtx
import nvtx
Quickstart
import time
import nvtx
@nvtx.annotate(color="blue", message="my_function_range")
def my_function():
for i in range(2):
with nvtx.annotate("loop_iteration", color="red", category=i):
time.sleep(0.05) # Simulate some work
if __name__ == "__main__":
my_function()
print("Execution finished. To profile, run with NVIDIA Nsight Systems: `nsys profile -t nvtx python <your_script_name>.py`")