ONNX Runtime GenAI
ONNX Runtime GenAI is a Python library that provides an easy, flexible, and performant way to run Generative AI models (Large Language Models and multi-modal models) on-device and in the cloud using ONNX Runtime. It encapsulates the complete generative AI loop, including pre- and post-processing, inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. The library is actively developed, with version 0.13.1 released in April 2026, generally following a quarterly release cadence in line with the broader ONNX Runtime project.
Common errors
-
ModuleNotFoundError: No module named 'onnxruntime_genai'
cause The package `onnxruntime-genai` is not installed in the active Python environment or the environment is not correctly activated (e.g., in a Jupyter Notebook with the wrong kernel).fixEnsure `pip install onnxruntime-genai` was run successfully in the correct virtual environment, or switch to the appropriate Python kernel in your IDE/notebook. -
ImportError: DLL load failed while importing onnxruntime_genai: A dynamic link library (DLL) initialization routine failed.
cause This usually occurs in a Conda environment on Windows due to an outdated C++ runtime for Visual Studio.fixIn your Conda environment, run: `conda install conda-forge::vs2015_runtime`. -
DLL load failed while importing onnxruntime_genai
cause On Windows with CUDA, this error often means the `CUDA_PATH` environment variable is not correctly set after CUDA Toolkit installation.fixEnsure the `CUDA_PATH` system environment variable is set to the installation directory of your CUDA Toolkit. -
ERROR: No matching distribution found for onnxruntime-genai
cause The currently used Python version (e.g., Python 3.13) does not have pre-built wheels available for `onnxruntime-genai` on PyPI.fixUse Python 3.10, 3.11, or 3.12, which have supported pre-built distributions. -
RuntimeError: [json.exception.type_error.302] type must be string, but is array.
cause Incompatibility between `onnxruntime-genai` (versions <= 0.4.0) and `HuggingFace transformers` (versions >= 4.45.0) when `tokenizer_config.json` uses an array for `model_input_names`.fixUpgrade `onnxruntime-genai` to version 0.5.0 or newer, or downgrade `transformers` to a version prior to 4.45.0.
Warnings
- breaking API changes in version 0.6.0 for 'chat mode' (continuation/continuous decoding) introduced a breaking change. The `GeneratorParams.input_ids` attribute and `generator.compute_logits()` method were replaced or made redundant.
- gotcha ONNX Runtime GenAI versions 0.4.0 and earlier were incompatible with `transformers` library version 4.45.0 and later when using the Model Builder tool, leading to `RuntimeError: [json.exception.type_error.302]` if `tokenizer_config.json` contained an array for the `model_input_names` field.
- gotcha Pre-built wheels for `onnxruntime-genai` currently do not support Python 3.13. Attempting to install may result in `ERROR: No matching distribution found`.
- gotcha Examples in the `main` branch of the GitHub repository may not be compatible with the latest stable PyPI release binaries due to ongoing development.
- breaking Models from earlier Ryzen AI releases are not compatible with Ryzen AI 1.7 (which uses OGA v0.11.2 from v0.9.2.2).
Install
-
pip install onnxruntime-genai -
pip install onnxruntime-genai-directml -
pip install onnxruntime-genai-cuda
Imports
- onnxruntime_genai
import onnxruntime_genai as og
Quickstart
import os
import onnxruntime_genai as og
# --- Prerequisite: Download a model ---
# The following shell command downloads the Phi-3 Mini 4K Instruct ONNX model (CPU-INT4 quantized).
# You will need to install huggingface_hub: pip install huggingface_hub
# Run this command in your terminal before executing the Python code:
# huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \
# --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* \
# --local-dir ./phi-3-mini-onnx
model_path = os.environ.get('ONNX_MODEL_PATH', './phi-3-mini-onnx')
try:
# 1. Load the model
model = og.Model(model_path)
print(f"Loaded {model.type} on {model.device_type}")
# 2. Create a tokenizer
tokenizer = og.Tokenizer(model)
# 3. Create generator parameters
params = og.GeneratorParams(model)
params.set_search_options(max_length=200, top_p=0.9, temperature=0.7)
# 4. Encode initial prompt and append to generator
prompt = "The capital of France is"
input_tokens = tokenizer.encode(prompt)
# 5. Create a generator instance
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)
print(f"Prompt: {prompt}")
print("Generated text:", end="")
# 6. Generate tokens one by one and decode for streaming output
while not generator.is_done():
generator.generate_next_token()
last_token = generator.get_sequence(0)[-1]
print(tokenizer.decode([last_token]), end="", flush=True)
print()
# Get the full decoded sequence (optional, for non-streaming output)
# output = tokenizer.decode(generator.get_sequence(0))
# print(f"\nFull output: {output}")
except Exception as e:
print(f"An error occurred: {e}")
print(f"Please ensure the model is downloaded to '{model_path}' and all dependencies are installed.")