{"library":"perf-analyzer","title":"Triton Performance Analyzer","description":"Triton Performance Analyzer (perf_analyzer) is a command-line interface (CLI) tool designed to optimize the inference performance of models running on the NVIDIA Triton Inference Server. It measures key metrics such as throughput and latency by generating inference requests to your model and repeating measurements until stable values are achieved. The library is currently at version 2.59.1 and follows the release cadence of the broader Triton Inference Server project.","language":"python","status":"active","last_verified":"Mon Apr 13","install":{"commands":["pip install perf-analyzer"],"cli":{"name":"perf_analyzer","version":"Perf Analyzer Version 0.0.0 (commit 3a49a23)"}},"imports":[],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"# Assuming Triton Inference Server is running at localhost:8000 with a model named 'my_model'\n# First, ensure Triton is running. Example (simplified):\n# docker pull nvcr.io/nvidia/tritonserver:24.02-py3\n# docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:24.02-py3\n# (Inside container) tritonserver --model-repository /models & \n\n# Run perf_analyzer from a terminal where Triton is accessible\nperf_analyzer -m my_model --measurement-interval 5000 --concurrency-range 1:8:2","lang":"bash","description":"This quickstart demonstrates how to run `perf_analyzer` against a hypothetical model named 'my_model' on a running Triton Inference Server. It measures performance over a 5-second interval across varying concurrency levels (1, 3, 5, 7).","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}