{"id":8817,"library":"aiperf","title":"AIPerf","description":"AIPerf is a comprehensive benchmarking tool designed to measure the performance of generative AI models served by various inference solutions. It provides detailed metrics and extensive benchmark performance reports through a command-line interface. The library is actively maintained, with regular releases (4-12 per year), and the current version is 0.7.0.","status":"active","version":"0.7.0","language":"en","source_language":"en","source_url":"https://github.com/ai-dynamo/aiperf","tags":["AI","benchmarking","performance testing","LLM","generative AI","inference"],"install":[{"cmd":"pip install aiperf","lang":"bash","label":"Install AIPerf"}],"dependencies":[{"reason":"Requires Python version 3.10 or newer.","package":"python","optional":false}],"imports":[{"note":"Used for parsing per-request metric records from benchmark output files (profile_export.jsonl).","symbol":"MetricRecordInfo","correct":"from aiperf.common.models import MetricRecordInfo"},{"note":"Used for programmatic dataset synthesis, typically within custom plugin development.","symbol":"Synthesizer","correct":"from aiperf.dataset.synthesis import Synthesizer"},{"note":"For reproducible data generation within plugins, always use `aiperf.common.random_generator.derive()` instead of Python's built-in `random` module, as `random`'s global state is fragile and can affect reproducibility.","wrong":"import random","symbol":"random_generator as rng","correct":"from aiperf.common import random_generator as rng"}],"quickstart":{"code":"python3 -m venv venv\nsource venv/bin/activate\npip install aiperf\n\n# Assuming an Ollama server is running locally with a model like 'granite4:350m'\n# (e.g., via docker run -d --name ollama -p 11434:11434 -v ollama-data:/root/.ollama ollama/ollama:latest && docker exec -it ollama ollama pull granite4:350m)\n\naiperf profile \\\n  --model \"granite4:350m\" \\\n  --streaming \\\n  --endpoint-type chat \\\n  --tokenizer ibm-granite/granite-4.0-micro \\\n  --url http://localhost:11434 \\\n  --concurrency 5 \\\n  --request-count 10","lang":"bash","description":"This quickstart demonstrates running a basic performance benchmark against a locally running Ollama server. It profiles a specified model using a chat endpoint with streaming enabled, a specific tokenizer, and defines concurrency and request count."},"warnings":[{"fix":"Consult your inference server's documentation and use `--extra-inputs` to configure `ignore_eos` or `min_tokens` if precise output length control is critical for your benchmark.","message":"Output sequence length constraints (`--output-tokens-mean`) may not be guaranteed by the inference server unless `ignore_eos` and/or `min_tokens` are explicitly passed via `--extra-inputs` to a supporting server.","severity":"gotcha","affected_versions":">=0.1.0"},{"fix":"Be aware that reported latency percentiles only reflect successful requests. For a complete understanding, also examine the error rates and raw request data (`profile_export.jsonl`) to manually account for failed requests.","message":"Latency percentile metrics (P50, P90, P99) in AIPerf's current versions (prior to a potential future update) only consider successful requests. This can lead to misleading performance reports, especially in high error-rate scenarios, as failed requests are not factored into the percentiles.","severity":"breaking","affected_versions":">=0.1.0"},{"fix":"If encountering connection failures at high concurrency, reduce the `--concurrency` value. You may also need to adjust system-level ephemeral port limits (e.g., on Linux, by modifying `/proc/sys/net/ipv4/ip_local_port_range`).","message":"Very high concurrency settings (typically >15,000) can lead to ephemeral port exhaustion on some systems, resulting in connection failures.","severity":"gotcha","affected_versions":">=0.1.0"},{"fix":"If AIPerf appears to freeze during startup, terminate the process (Ctrl+C) and thoroughly check all configuration settings and command-line arguments for syntax or logical errors.","message":"Startup errors caused by invalid configuration settings can cause AIPerf to hang indefinitely during initialization.","severity":"gotcha","affected_versions":">=0.1.0"},{"fix":"Update your `inputs.json` files to use the `payloads` array and include `session_id` for each entry. Review the `Migrating from Genai-Perf` documentation for a detailed comparison.","message":"When migrating from GenAI-Perf, the `payload` field in `inputs.json` has been renamed to `payloads` (plural) to better support multi-turn conversations. Additionally, a new `session_id` field has been added.","severity":"breaking","affected_versions":"0.1.0 - 0.7.0 (migration from GenAI-Perf)"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Reduce the `--concurrency` parameter in your AIPerf command. For persistent high-concurrency needs, consult your operating system's documentation on increasing ephemeral port limits.","cause":"The system has run out of available ephemeral ports to establish new outgoing connections, often due to extremely high concurrency settings (e.g., >15,000 requests).","error":"Connection refused / port exhaustion errors at high concurrency"},{"fix":"Terminate the hung process (e.g., Ctrl+C) and carefully review all command-line arguments and configuration files for syntax errors, missing values, or logical inconsistencies.","cause":"Invalid configuration settings or command-line arguments prevent the AIPerf system from initializing correctly, leading to a deadlock or infinite wait state.","error":"AIPerf hangs indefinitely during startup/initialization"},{"fix":"Enable server token count support if your inference server provides it. This ensures AIPerf uses the server's own token counts for accurate throughput and output length metrics, reducing client-side overhead.","cause":"By default, AIPerf may use its local tokenizer to calculate token-based metrics. If the inference server uses a different tokenizer or tokenization strategy, these client-side calculations can be inaccurate.","error":"Metrics for tokens-per-second or output sequence length appear incorrect when server tokenization differs from local tokenizer."}]}