{"id":5364,"library":"perf-analyzer","title":"Triton Performance Analyzer","description":"Triton Performance Analyzer (perf_analyzer) is a command-line interface (CLI) tool designed to optimize the inference performance of models running on the NVIDIA Triton Inference Server. It measures key metrics such as throughput and latency by generating inference requests to your model and repeating measurements until stable values are achieved. The library is currently at version 2.59.1 and follows the release cadence of the broader Triton Inference Server project.","status":"active","version":"2.59.1","language":"en","source_language":"en","source_url":"https://github.com/triton-inference-server/perf_analyzer","tags":["performance","benchmarking","inference","nvidia","triton","cli","latency","throughput"],"install":[{"cmd":"pip install perf-analyzer","lang":"bash","label":"Pip installation"},{"cmd":"docker run --rm --gpus=all -it --net=host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk perf_analyzer -m <model>","lang":"bash","label":"Recommended Docker SDK Container"}],"dependencies":[],"imports":[],"quickstart":{"code":"# Assuming Triton Inference Server is running at localhost:8000 with a model named 'my_model'\n# First, ensure Triton is running. Example (simplified):\n# docker pull nvcr.io/nvidia/tritonserver:24.02-py3\n# docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:24.02-py3\n# (Inside container) tritonserver --model-repository /models & \n\n# Run perf_analyzer from a terminal where Triton is accessible\nperf_analyzer -m my_model --measurement-interval 5000 --concurrency-range 1:8:2","lang":"bash","description":"This quickstart demonstrates how to run `perf_analyzer` against a hypothetical model named 'my_model' on a running Triton Inference Server. It measures performance over a 5-second interval across varying concurrency levels (1, 3, 5, 7)."},"warnings":[{"fix":"For generative AI model benchmarking, consider using AIPerf instead of genai-perf. Consult NVIDIA Triton documentation for AIPerf migration guides.","message":"The related tool `genai-perf` is being deprecated. Users should migrate to `AIPerf` for continued support and enhanced features in generative AI model benchmarking.","severity":"deprecated","affected_versions":"All versions"},{"fix":"Manually install any reported missing runtime dependencies. The recommended installation method is via the Triton SDK Docker container, which includes all necessary pre-built executables and dependencies.","message":"When installing `perf-analyzer` via `pip`, runtime dependencies (e.g., CUDA-related libraries for GPU support, `tritonclient` dependencies) are not automatically managed. Missing dependencies will cause errors during execution.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Avoid these specific options when using `--service-kind=triton_c_api`. Consider HTTP or gRPC endpoints for full feature support if these functionalities are critical.","message":"Direct C API mode within `perf_analyzer` has known limitations, including lack of support for asynchronous mode (`-a`), shared memory mode (`--shared-memory`), and request rate range mode.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For more stable and potentially lower-latency measurements, use the `--shared-memory=system` (for CPU shared memory) or `--shared-memory=cuda` (for GPU shared memory) options when benchmarking.","message":"Performance metrics, especially latency, can vary significantly between runs when not using shared memory. Using `--shared-memory=system` or `--shared-memory=cuda` can lead to more stable and representative results by reducing network overhead.","severity":"gotcha","affected_versions":"All versions"},{"fix":"It is generally recommended to run a single `perf_analyzer` instance per Triton server or manage concurrent testing through a single `perf_analyzer` process's built-in concurrency options. If multi-model analysis is needed, consider `Triton Model Analyzer`.","message":"Running multiple `perf_analyzer` processes concurrently against a single Triton Inference Server instance can lead to unexpected behavior or issues.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-13T00:00:00.000Z","next_check":"2026-07-12T00:00:00.000Z"}