GenAI Perf Analyzer CLI
GenAI-Perf is a command-line interface (CLI) tool designed for measuring the throughput and latency of generative AI models (Large Language Models, Vision Language Models, Embedding Models, Ranking Models, and LoRA Adapters) served through an inference server. It generates load, measures key performance metrics such as output token throughput, time to first token, inter-token latency, and request throughput, and reports the results to the console, CSV, and JSON files. While currently at version 0.0.16 and under rapid development, it is being actively phased out in favor of NVIDIA's new `AIPerf` tool for generative AI benchmarking.
Common errors
-
genai-perf: command not found
cause The `genai-perf` executable is not in your system's PATH, or the package was not installed correctly in the active environment. This is common if installed within a Docker container and attempting to run outside it, or if a Python virtual environment isn't activated.fixIf installed via `pip install genai-perf`, ensure your Python virtual environment is activated. If using the Triton SDK container, run `genai-perf` from *inside* the container (`docker run -it ... genai-perf --help`). -
Missing 'input_output_genai_perf.csv' files in artifact directories after running 'genai-perf analyze'.
cause This issue has been reported in forum discussions, potentially due to the `measurement-interval` being too short for the inference server to complete enough requests, or related to specific container versions.fixIncrease the `--measurement-interval` (e.g., `-p 10000` for 10 seconds or higher) to allow sufficient time for requests to complete and data to be recorded. Ensure you are using a consistent and recommended `genai-perf` container version as per NVIDIA's documentation, or check the `--profile-export-file` for other output files. -
Error during request generation or processing (e.g., connection refused, HTTP 4xx/5xx errors).
cause The target inference server is either not running, inaccessible at the specified URL, or the model is not loaded correctly. Alternatively, there might be an issue with the `--model` name, `--backend`, or `--endpoint-type` configuration.fixVerify that your Triton Inference Server (or other endpoint) is running and reachable at the `--url` provided. Confirm the model specified by `-m` or `--model` is correctly loaded and the `--backend` and `--endpoint-type` (e.g., `chat`, `completions`, `embeddings`) match the server's configuration and model's capabilities.
Warnings
- deprecated `genai-perf` is being phased out. NVIDIA recommends migrating to `AIPerf` for new generative AI performance benchmarking needs, as `genai-perf` will no longer receive active feature development.
- gotcha GenAI-Perf is a CLI tool and requires an inference server (e.g., NVIDIA Triton Inference Server or an OpenAI-compatible API endpoint) with a model loaded to be already running and accessible before `genai-perf` can perform benchmarking.
- gotcha The `genai-perf` tool is in early release and under rapid development. Command-line options and functionalities are subject to change between minor versions.
- gotcha When benchmarking models from gated repositories (e.g., some Hugging Face models like Llama 3), an `HF_TOKEN` environment variable may be required for authentication to download the tokenizer.
Install
-
pip install genai-perf -
export RELEASE="YY.MM" # e.g. export RELEASE="24.06" docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk genai-perf --help
Imports
- Checkpoint
from genai_perf.checkpoint.checkpoint import Checkpoint
- Results
from genai_perf.config.run.results import Results
Quickstart
# Note: A Triton Inference Server or OpenAI-compatible API endpoint with a model (e.g., GPT-2 TensorRT-LLM) must be running.
# For example, to run GPT-2 on Triton, you might use 'triton import -m gpt2 --backend tensorrtllm' and then 'triton start'.
genai-perf profile \
-m gpt2 \
--backend tensorrtllm \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--output-tokens-mean-deterministic \
--streaming \
--request-count 50 \
--warmup-request-count 10
# This command will generate a load against the 'gpt2' model served by Triton
# (using TensorRT-LLM backend), measuring performance metrics for 50 requests
# with synthetic inputs and a deterministic output length, after a 10-request warmup.
# Results are printed to console and saved to files in the 'artifacts' directory.