{"id":7255,"library":"genai-perf","title":"GenAI Perf Analyzer CLI","description":"GenAI-Perf is a command-line interface (CLI) tool designed for measuring the throughput and latency of generative AI models (Large Language Models, Vision Language Models, Embedding Models, Ranking Models, and LoRA Adapters) served through an inference server. It generates load, measures key performance metrics such as output token throughput, time to first token, inter-token latency, and request throughput, and reports the results to the console, CSV, and JSON files. While currently at version 0.0.16 and under rapid development, it is being actively phased out in favor of NVIDIA's new `AIPerf` tool for generative AI benchmarking.","status":"deprecated","version":"0.0.16","language":"en","source_language":"en","source_url":"https://github.com/triton-inference-server/perf_analyzer","tags":["AI","LLM","performance","profiling","benchmark","Triton","NVIDIA","CLI"],"install":[{"cmd":"pip install genai-perf","lang":"bash","label":"Recommended Python Installation"},{"cmd":"export RELEASE=\"YY.MM\" # e.g. export RELEASE=\"24.06\"\ndocker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk\ngenai-perf --help","lang":"bash","label":"Triton Server SDK Container (includes dependencies)"}],"dependencies":[{"reason":"Required runtime environment.","package":"python","optional":false},{"reason":"Required for interactions with Triton Inference Server. Automatically included with Triton SDK container or installed separately.","package":"tritonclient","optional":true}],"imports":[{"note":"Used for programmatic access to `analyze` command results, not for direct profiling.","symbol":"Checkpoint","correct":"from genai_perf.checkpoint.checkpoint import Checkpoint"},{"note":"Used for programmatic access to `analyze` command results, not for direct profiling.","symbol":"Results","correct":"from genai_perf.config.run.results import Results"}],"quickstart":{"code":"# Note: A Triton Inference Server or OpenAI-compatible API endpoint with a model (e.g., GPT-2 TensorRT-LLM) must be running.\n# For example, to run GPT-2 on Triton, you might use 'triton import -m gpt2 --backend tensorrtllm' and then 'triton start'.\n\ngenai-perf profile \\\n    -m gpt2 \\\n    --backend tensorrtllm \\\n    --synthetic-input-tokens-mean 200 \\\n    --synthetic-input-tokens-stddev 0 \\\n    --output-tokens-mean 100 \\\n    --output-tokens-stddev 0 \\\n    --output-tokens-mean-deterministic \\\n    --streaming \\\n    --request-count 50 \\\n    --warmup-request-count 10\n\n# This command will generate a load against the 'gpt2' model served by Triton\n# (using TensorRT-LLM backend), measuring performance metrics for 50 requests\n# with synthetic inputs and a deterministic output length, after a 10-request warmup.\n# Results are printed to console and saved to files in the 'artifacts' directory.","lang":"bash","description":"This quickstart demonstrates how to run a performance profile against a hypothetical GPT-2 model served by Triton Inference Server using synthetic data. Ensure your inference server and model are already running before executing this command. The output will include metrics like Time to First Token, Inter-Token Latency, and Request Latency."},"warnings":[{"fix":"For new projects or existing projects requiring continued support, evaluate and transition to `AIPerf`. Consult NVIDIA's documentation for migration guides.","message":"`genai-perf` is being phased out. NVIDIA recommends migrating to `AIPerf` for new generative AI performance benchmarking needs, as `genai-perf` will no longer receive active feature development.","severity":"deprecated","affected_versions":"All versions"},{"fix":"Ensure your inference server is configured and running, and your target model is loaded and ready to receive requests, before invoking `genai-perf` commands.","message":"GenAI-Perf is a CLI tool and requires an inference server (e.g., NVIDIA Triton Inference Server or an OpenAI-compatible API endpoint) with a model loaded to be already running and accessible before `genai-perf` can perform benchmarking.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Refer to the latest official documentation or `genai-perf --help` for the most up-to-date command-line arguments. Pin exact versions in production environments to avoid unexpected changes.","message":"The `genai-perf` tool is in early release and under rapid development. Command-line options and functionalities are subject to change between minor versions.","severity":"gotcha","affected_versions":"0.0.x"},{"fix":"Set the `HF_TOKEN` environment variable with a valid Hugging Face token (e.g., `export HF_TOKEN='hf_YOUR_TOKEN'`) before running `genai-perf` if using such models.","message":"When benchmarking models from gated repositories (e.g., some Hugging Face models like Llama 3), an `HF_TOKEN` environment variable may be required for authentication to download the tokenizer.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"If installed via `pip install genai-perf`, ensure your Python virtual environment is activated. If using the Triton SDK container, run `genai-perf` from *inside* the container (`docker run -it ... genai-perf --help`).","cause":"The `genai-perf` executable is not in your system's PATH, or the package was not installed correctly in the active environment. This is common if installed within a Docker container and attempting to run outside it, or if a Python virtual environment isn't activated.","error":"genai-perf: command not found"},{"fix":"Increase the `--measurement-interval` (e.g., `-p 10000` for 10 seconds or higher) to allow sufficient time for requests to complete and data to be recorded. Ensure you are using a consistent and recommended `genai-perf` container version as per NVIDIA's documentation, or check the `--profile-export-file` for other output files.","cause":"This issue has been reported in forum discussions, potentially due to the `measurement-interval` being too short for the inference server to complete enough requests, or related to specific container versions.","error":"Missing 'input_output_genai_perf.csv' files in artifact directories after running 'genai-perf analyze'."},{"fix":"Verify that your Triton Inference Server (or other endpoint) is running and reachable at the `--url` provided. Confirm the model specified by `-m` or `--model` is correctly loaded and the `--backend` and `--endpoint-type` (e.g., `chat`, `completions`, `embeddings`) match the server's configuration and model's capabilities.","cause":"The target inference server is either not running, inaccessible at the specified URL, or the model is not loaded correctly. Alternatively, there might be an issue with the `--model` name, `--backend`, or `--endpoint-type` configuration.","error":"Error during request generation or processing (e.g., connection refused, HTTP 4xx/5xx errors)."}]}