Triton Performance Analyzer
Triton Performance Analyzer (perf_analyzer) is a command-line interface (CLI) tool designed to optimize the inference performance of models running on the NVIDIA Triton Inference Server. It measures key metrics such as throughput and latency by generating inference requests to your model and repeating measurements until stable values are achieved. The library is currently at version 2.59.1 and follows the release cadence of the broader Triton Inference Server project.
Warnings
- deprecated The related tool `genai-perf` is being deprecated. Users should migrate to `AIPerf` for continued support and enhanced features in generative AI model benchmarking.
- gotcha When installing `perf-analyzer` via `pip`, runtime dependencies (e.g., CUDA-related libraries for GPU support, `tritonclient` dependencies) are not automatically managed. Missing dependencies will cause errors during execution.
- gotcha Direct C API mode within `perf_analyzer` has known limitations, including lack of support for asynchronous mode (`-a`), shared memory mode (`--shared-memory`), and request rate range mode.
- gotcha Performance metrics, especially latency, can vary significantly between runs when not using shared memory. Using `--shared-memory=system` or `--shared-memory=cuda` can lead to more stable and representative results by reducing network overhead.
- gotcha Running multiple `perf_analyzer` processes concurrently against a single Triton Inference Server instance can lead to unexpected behavior or issues.
Install
-
pip install perf-analyzer -
docker run --rm --gpus=all -it --net=host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk perf_analyzer -m <model>
Quickstart
# Assuming Triton Inference Server is running at localhost:8000 with a model named 'my_model' # First, ensure Triton is running. Example (simplified): # docker pull nvcr.io/nvidia/tritonserver:24.02-py3 # docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:24.02-py3 # (Inside container) tritonserver --model-repository /models & # Run perf_analyzer from a terminal where Triton is accessible perf_analyzer -m my_model --measurement-interval 5000 --concurrency-range 1:8:2