AIPerf
AIPerf is a comprehensive benchmarking tool designed to measure the performance of generative AI models served by various inference solutions. It provides detailed metrics and extensive benchmark performance reports through a command-line interface. The library is actively maintained, with regular releases (4-12 per year), and the current version is 0.7.0.
Common errors
-
Connection refused / port exhaustion errors at high concurrency
cause The system has run out of available ephemeral ports to establish new outgoing connections, often due to extremely high concurrency settings (e.g., >15,000 requests).fixReduce the `--concurrency` parameter in your AIPerf command. For persistent high-concurrency needs, consult your operating system's documentation on increasing ephemeral port limits. -
AIPerf hangs indefinitely during startup/initialization
cause Invalid configuration settings or command-line arguments prevent the AIPerf system from initializing correctly, leading to a deadlock or infinite wait state.fixTerminate the hung process (e.g., Ctrl+C) and carefully review all command-line arguments and configuration files for syntax errors, missing values, or logical inconsistencies. -
Metrics for tokens-per-second or output sequence length appear incorrect when server tokenization differs from local tokenizer.
cause By default, AIPerf may use its local tokenizer to calculate token-based metrics. If the inference server uses a different tokenizer or tokenization strategy, these client-side calculations can be inaccurate.fixEnable server token count support if your inference server provides it. This ensures AIPerf uses the server's own token counts for accurate throughput and output length metrics, reducing client-side overhead.
Warnings
- gotcha Output sequence length constraints (`--output-tokens-mean`) may not be guaranteed by the inference server unless `ignore_eos` and/or `min_tokens` are explicitly passed via `--extra-inputs` to a supporting server.
- breaking Latency percentile metrics (P50, P90, P99) in AIPerf's current versions (prior to a potential future update) only consider successful requests. This can lead to misleading performance reports, especially in high error-rate scenarios, as failed requests are not factored into the percentiles.
- gotcha Very high concurrency settings (typically >15,000) can lead to ephemeral port exhaustion on some systems, resulting in connection failures.
- gotcha Startup errors caused by invalid configuration settings can cause AIPerf to hang indefinitely during initialization.
- breaking When migrating from GenAI-Perf, the `payload` field in `inputs.json` has been renamed to `payloads` (plural) to better support multi-turn conversations. Additionally, a new `session_id` field has been added.
Install
-
pip install aiperf
Imports
- MetricRecordInfo
from aiperf.common.models import MetricRecordInfo
- Synthesizer
from aiperf.dataset.synthesis import Synthesizer
- random_generator as rng
import random
from aiperf.common import random_generator as rng
Quickstart
python3 -m venv venv source venv/bin/activate pip install aiperf # Assuming an Ollama server is running locally with a model like 'granite4:350m' # (e.g., via docker run -d --name ollama -p 11434:11434 -v ollama-data:/root/.ollama ollama/ollama:latest && docker exec -it ollama ollama pull granite4:350m) aiperf profile \ --model "granite4:350m" \ --streaming \ --endpoint-type chat \ --tokenizer ibm-granite/granite-4.0-micro \ --url http://localhost:11434 \ --concurrency 5 \ --request-count 10