Qwen Omni Language Model Utilities
Qwen Omni Language Model Utils is a Python library providing a toolkit to conveniently handle various types of audio and visual input for Qwen Omni multimodal models. It simplifies processing base64, URLs, and interleaved audio, images, and videos, offering an API-like experience. This library is current at version 0.0.9 and is actively maintained by the Qwen team as part of their multimodal large language model ecosystem.
Common errors
-
KeyError: 'qwen2_5_omni'
cause This error typically occurs when the `transformers` library does not have the necessary Qwen Omni model configurations, often due to an outdated or incompatible version of `transformers` with the specific Qwen Omni model being loaded.fixEnsure you are using the `transformers` version explicitly recommended by the Qwen model's documentation or Hugging Face page. This might involve uninstalling your current `transformers` and installing a specific version or from a particular Git branch, e.g., `pip uninstall transformers && pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview`. -
Video processing is slow or fails.
cause Often due to `decord` not being installed or `ffmpeg` not being available on the system. If `decord` isn't installed, `qwen-omni-utils` falls back to `torchvision`, which can be less performant for video.fixFirst, ensure `ffmpeg` is installed on your system (`sudo apt-get install ffmpeg` on Debian/Ubuntu). Then, install `qwen-omni-utils` with the `decord` extra: `pip install qwen-omni-utils[decord] -U`. If `decord` fails to install on your OS (e.g., Windows/macOS), you may need to compile it from source or use the `torchvision` fallback with potentially reduced performance. -
ModuleNotFoundError: No module named 'qwen_omni_utils'
cause The `qwen-omni-utils` package is not installed in your current Python environment.fixInstall the package using pip: `pip install qwen-omni-utils` or `pip install qwen-omni-utils[decord]` for full functionality. If using a virtual environment, ensure it is activated before installation.
Warnings
- breaking The GitHub repository for `qwen-omni-utils` can be significantly out of sync with the PyPI release. New features or fixes present in the PyPI package might not be reflected in the public GitHub source code for the utility, leading to confusion when reviewing source or contributing.
- gotcha Strict `transformers` library version compatibility is often required for Qwen Omni models. Using an incompatible `transformers` version can lead to `KeyError: 'qwen2_5_omni'` or other model loading failures.
- gotcha `decord` for faster video loading might not install correctly from PyPI on non-Linux systems. If `decord` installation fails, `qwen-omni-utils` will fall back to `torchvision` for video processing, which might be slower.
- gotcha When integrating `qwen-omni-utils` with `vLLM` for inference, users have reported issues where text generation either cuts off abruptly or enters an infinite repetition loop. This is linked to internal differences in how `positions`, `eager`, and `CUDA` parameters are handled within `Qwen2Attention` module in `vLLM`.
Install
-
pip install qwen-omni-utils -U -
pip install qwen-omni-utils[decord] -U sudo apt-get install ffmpeg
Imports
- process_mm_info
from qwen_omni_utils import process_mm_info
Quickstart
import soundfile as sf
import torch
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
import os
# NOTE: Replace with your actual model path or Hugging Face model ID
model_id = "Qwen/Qwen2.5-Omni-7B"
# Ensure you have a Hugging Face token if using private models
# os.environ['HF_TOKEN'] = os.environ.get('HF_TOKEN', 'hf_YOUR_TOKEN_HERE')
# Load model and processor (requires significant GPU memory)
# model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
# model_id, torch_dtype="auto", device_map="auto"
# )
# processor = Qwen2_5OmniProcessor.from_pretrained(model_id)
# Example usage with process_mm_info (assuming model/processor loaded above)
# This function prepares multimodal content for the processor.
# content = [
# {"type": "text", "text": "Describe this image:"},
# {"type": "image", "image": "https://example.com/image.jpg"},
# {"type": "text", "text": "And tell me about this audio:"},
# {"type": "audio", "audio": "https://example.com/audio.wav"}
# ]
# processed_content = process_mm_info(content, processor)
print("qwen-omni-utils is successfully imported and ready to process multimodal inputs.")
print("Refer to Qwen model documentation for full model loading and inference examples.")