ONNX Runtime GenAI

0.13.1 · active · verified Thu Apr 16

ONNX Runtime GenAI is a Python library that provides an easy, flexible, and performant way to run Generative AI models (Large Language Models and multi-modal models) on-device and in the cloud using ONNX Runtime. It encapsulates the complete generative AI loop, including pre- and post-processing, inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. The library is actively developed, with version 0.13.1 released in April 2026, generally following a quarterly release cadence in line with the broader ONNX Runtime project.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a pre-optimized ONNX model (like Phi-3 Mini), tokenize an input prompt, and generate text using the `onnxruntime-genai` library. Before running the Python code, you must download an ONNX model, typically using `huggingface-cli` into a local directory. The example uses environment variables for the model path for flexibility.

import os
import onnxruntime_genai as og

# --- Prerequisite: Download a model ---
# The following shell command downloads the Phi-3 Mini 4K Instruct ONNX model (CPU-INT4 quantized).
# You will need to install huggingface_hub: pip install huggingface_hub
# Run this command in your terminal before executing the Python code:
# huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \
#   --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* \
#   --local-dir ./phi-3-mini-onnx

model_path = os.environ.get('ONNX_MODEL_PATH', './phi-3-mini-onnx')

try:
    # 1. Load the model
    model = og.Model(model_path)
    print(f"Loaded {model.type} on {model.device_type}")

    # 2. Create a tokenizer
    tokenizer = og.Tokenizer(model)

    # 3. Create generator parameters
    params = og.GeneratorParams(model)
    params.set_search_options(max_length=200, top_p=0.9, temperature=0.7)

    # 4. Encode initial prompt and append to generator
    prompt = "The capital of France is"
    input_tokens = tokenizer.encode(prompt)

    # 5. Create a generator instance
    generator = og.Generator(model, params)
    generator.append_tokens(input_tokens)

    print(f"Prompt: {prompt}")
    print("Generated text:", end="")

    # 6. Generate tokens one by one and decode for streaming output
    while not generator.is_done():
        generator.generate_next_token()
        last_token = generator.get_sequence(0)[-1]
        print(tokenizer.decode([last_token]), end="", flush=True)
    print()

    # Get the full decoded sequence (optional, for non-streaming output)
    # output = tokenizer.decode(generator.get_sequence(0))
    # print(f"\nFull output: {output}")

except Exception as e:
    print(f"An error occurred: {e}")
    print(f"Please ensure the model is downloaded to '{model_path}' and all dependencies are installed.")

view raw JSON →