Triton Performance Analyzer

2.59.1 · active · verified Mon Apr 13

Triton Performance Analyzer (perf_analyzer) is a command-line interface (CLI) tool designed to optimize the inference performance of models running on the NVIDIA Triton Inference Server. It measures key metrics such as throughput and latency by generating inference requests to your model and repeating measurements until stable values are achieved. The library is currently at version 2.59.1 and follows the release cadence of the broader Triton Inference Server project.

Warnings

Install

Quickstart

This quickstart demonstrates how to run `perf_analyzer` against a hypothetical model named 'my_model' on a running Triton Inference Server. It measures performance over a 5-second interval across varying concurrency levels (1, 3, 5, 7).

# Assuming Triton Inference Server is running at localhost:8000 with a model named 'my_model'
# First, ensure Triton is running. Example (simplified):
# docker pull nvcr.io/nvidia/tritonserver:24.02-py3
# docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:24.02-py3
# (Inside container) tritonserver --model-repository /models & 

# Run perf_analyzer from a terminal where Triton is accessible
perf_analyzer -m my_model --measurement-interval 5000 --concurrency-range 1:8:2

view raw JSON →