GenAI Perf Analyzer CLI

0.0.16 · deprecated · verified Thu Apr 16

GenAI-Perf is a command-line interface (CLI) tool designed for measuring the throughput and latency of generative AI models (Large Language Models, Vision Language Models, Embedding Models, Ranking Models, and LoRA Adapters) served through an inference server. It generates load, measures key performance metrics such as output token throughput, time to first token, inter-token latency, and request throughput, and reports the results to the console, CSV, and JSON files. While currently at version 0.0.16 and under rapid development, it is being actively phased out in favor of NVIDIA's new `AIPerf` tool for generative AI benchmarking.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to run a performance profile against a hypothetical GPT-2 model served by Triton Inference Server using synthetic data. Ensure your inference server and model are already running before executing this command. The output will include metrics like Time to First Token, Inter-Token Latency, and Request Latency.

# Note: A Triton Inference Server or OpenAI-compatible API endpoint with a model (e.g., GPT-2 TensorRT-LLM) must be running.
# For example, to run GPT-2 on Triton, you might use 'triton import -m gpt2 --backend tensorrtllm' and then 'triton start'.

genai-perf profile \
    -m gpt2 \
    --backend tensorrtllm \
    --synthetic-input-tokens-mean 200 \
    --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean 100 \
    --output-tokens-stddev 0 \
    --output-tokens-mean-deterministic \
    --streaming \
    --request-count 50 \
    --warmup-request-count 10

# This command will generate a load against the 'gpt2' model served by Triton
# (using TensorRT-LLM backend), measuring performance metrics for 50 requests
# with synthetic inputs and a deterministic output length, after a 10-request warmup.
# Results are printed to console and saved to files in the 'artifacts' directory.

view raw JSON →