{"id":6757,"library":"optimum-onnx","title":"Optimum ONNX","description":"Optimum ONNX is a specialized extension of the Hugging Face Optimum library, providing a streamlined interface for exporting Hugging Face Transformer models (and other architectures like Diffusers, Timm, Sentence Transformers) to the ONNX format. It facilitates efficient inference and deployment using ONNX Runtime, including features like graph optimization and quantization. Currently at version 0.1.0, it sees regular updates to support new Hugging Face models and ensure compatibility with underlying libraries like PyTorch and Transformers.","status":"active","version":"0.1.0","language":"en","source_language":"en","source_url":"https://github.com/huggingface/optimum-onnx","tags":["huggingface","onnx","onnxruntime","transformers","machine-learning","inference","deep-learning","optimization"],"install":[{"cmd":"pip install optimum-onnx","lang":"bash","label":"Base installation"},{"cmd":"pip install \"optimum-onnx[onnxruntime]\"","lang":"bash","label":"With ONNX Runtime (CPU)"},{"cmd":"pip install \"optimum-onnx[onnxruntime-gpu]\"","lang":"bash","label":"With ONNX Runtime (GPU)"}],"dependencies":[{"reason":"Core optimization library that optimum-onnx extends.","package":"optimum"},{"reason":"Required for Hugging Face model integration and tokenizer utilities. Explicitly pinned versions are often required (e.g., >=4.36,<4.58.0).","package":"transformers"},{"reason":"ONNX format definition and tools.","package":"onnx"},{"reason":"Inference engine for ONNX models.","package":"onnxruntime","optional":true}],"imports":[{"note":"Common class for loading and exporting sequence classification models to ONNX. Other ORTModelForXxx classes exist for different tasks.","symbol":"ORTModelForSequenceClassification","correct":"from optimum.onnxruntime import ORTModelForSequenceClassification"},{"note":"When running inference with an ONNX model, `optimum.onnxruntime.pipeline` should be used instead of `transformers.pipeline` for accelerated execution.","wrong":"from transformers import pipeline","symbol":"pipeline","correct":"from optimum.onnxruntime import pipeline"},{"note":"Used for loading tokenizers compatible with Hugging Face models.","symbol":"AutoTokenizer","correct":"from transformers import AutoTokenizer"},{"note":"Used to define quantization strategies for ONNX Runtime.","symbol":"AutoQuantizationConfig","correct":"from optimum.onnxruntime.configuration import AutoQuantizationConfig"}],"quickstart":{"code":"import os\nfrom optimum.onnxruntime import ORTModelForSequenceClassification\nfrom optimum.onnxruntime import pipeline as ORTPipeline # Alias to avoid conflict with transformers.pipeline if imported\nfrom transformers import AutoTokenizer\n\nmodel_checkpoint = \"distilbert-base-uncased-finetuned-sst-2-english\"\nsave_directory = \"./tmp/onnx_model\"\n\n# 1. Load a model from transformers and export it to ONNX\nprint(f\"Exporting model {model_checkpoint} to ONNX...\")\nort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)\ntokenizer = AutoTokenizer.from_pretrained(model_checkpoint)\n\n# 2. Save the ONNX model and tokenizer\nos.makedirs(save_directory, exist_ok=True)\nort_model.save_pretrained(save_directory)\ntokenizer.save_pretrained(save_directory)\nprint(f\"Model and tokenizer saved to {save_directory}\")\n\n# 3. Load the exported ONNX model for inference\nprint(f\"Loading ONNX model from {save_directory} for inference...\")\nloaded_ort_model = ORTModelForSequenceClassification.from_pretrained(save_directory, file_name=\"model.onnx\")\nloaded_tokenizer = AutoTokenizer.from_pretrained(save_directory)\n\n# 4. Run inference using the Optimum ONNX Runtime pipeline\ncls_pipeline = ORTPipeline(\"text-classification\", model=loaded_ort_model, tokenizer=loaded_tokenizer)\nresults = cls_pipeline(\"I love using Hugging Face Optimum ONNX!\")\nprint(f\"Inference result: {results}\")\n\n# Example with a quantized model (if applicable)\n# from optimum.onnxruntime.configuration import AutoQuantizationConfig\n# from optimum.onnxruntime import ORTQuantizer\n# qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)\n# quantizer = ORTQuantizer.from_pretrained(ort_model)\n# quantizer.quantize(save_dir=save_directory, quantization_config=qconfig)\n# loaded_quantized_model = ORTModelForSequenceClassification.from_pretrained(save_directory, file_name=\"model_quantized.onnx\")\n# cls_pipeline_quant = ORTPipeline(\"text-classification\", model=loaded_quantized_model, tokenizer=loaded_tokenizer)\n# results_quant = cls_pipeline_quant(\"I love using Hugging Face Optimum ONNX with quantization!\")\n# print(f\"Quantized inference result: {results_quant}\")\n","lang":"python","description":"This quickstart demonstrates the core workflow: exporting a Hugging Face model to ONNX using `ORTModelForSequenceClassification.from_pretrained(export=True)`, saving the exported model and tokenizer, then loading the ONNX model and performing inference with the `optimum.onnxruntime.pipeline`."},"warnings":[{"fix":"Run `pip uninstall onnxruntime` before `pip install \"optimum-onnx[onnxruntime-gpu]\"` (or vice-versa).","message":"Direct `onnxruntime` and `onnxruntime-gpu` installation conflicts. If you've installed one, ensure to `pip uninstall` it before installing the other to prevent package conflicts.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Keep `optimum-onnx` updated. Check release notes for `torch` and `transformers` compatibility.","message":"Compatibility with `torch.onnx.export` and specific PyTorch versions can be challenging. Patch releases often address these, so ensure `optimum-onnx` is up-to-date, especially when working with newer PyTorch or Transformers versions.","severity":"gotcha","affected_versions":"<=v0.0.3 (historically, may recur)"},{"fix":"Verify hardware support for desired optimization/quantization strategies. Re-optimize or re-quantize if deploying to different hardware.","message":"Optimization and quantization techniques applied with Optimum ONNX are often hardware-specific. For instance, `int8` quantization might only be supported on CPUs, and switching hardware after optimization can lead to issues.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always specify `file_name=\"model.onnx\"` or `file_name=\"model_quantized.onnx\"` when loading a saved ONNX model using `ORTModelForXxx.from_pretrained(save_directory, file_name=...)`.","message":"When loading a model for inference after export, ensure you load the ONNX model file (e.g., `model.onnx` or `model_quantized.onnx`) by specifying `file_name` in `ORTModelForXxx.from_pretrained`.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}