Qwen Omni Language Model Utilities

0.0.9 · active · verified Thu Apr 16

Qwen Omni Language Model Utils is a Python library providing a toolkit to conveniently handle various types of audio and visual input for Qwen Omni multimodal models. It simplifies processing base64, URLs, and interleaved audio, images, and videos, offering an API-like experience. This library is current at version 0.0.9 and is actively maintained by the Qwen team as part of their multimodal large language model ecosystem.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to import `process_mm_info` from `qwen_omni_utils`. While it shows how model and processor loading would typically be done, the actual heavy model loading and inference steps are commented out due to resource requirements. The `process_mm_info` function is key for preparing diverse multimodal inputs for the Qwen Omni models. Ensure you have `ffmpeg` installed for full video capabilities and a compatible `transformers` version.

import soundfile as sf
import torch
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
import os

# NOTE: Replace with your actual model path or Hugging Face model ID
model_id = "Qwen/Qwen2.5-Omni-7B"

# Ensure you have a Hugging Face token if using private models
# os.environ['HF_TOKEN'] = os.environ.get('HF_TOKEN', 'hf_YOUR_TOKEN_HERE') 

# Load model and processor (requires significant GPU memory)
# model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
#     model_id, torch_dtype="auto", device_map="auto"
# )
# processor = Qwen2_5OmniProcessor.from_pretrained(model_id)

# Example usage with process_mm_info (assuming model/processor loaded above)
# This function prepares multimodal content for the processor.
# content = [
#     {"type": "text", "text": "Describe this image:"},
#     {"type": "image", "image": "https://example.com/image.jpg"},
#     {"type": "text", "text": "And tell me about this audio:"},
#     {"type": "audio", "audio": "https://example.com/audio.wav"}
# ]
# processed_content = process_mm_info(content, processor)

print("qwen-omni-utils is successfully imported and ready to process multimodal inputs.")
print("Refer to Qwen model documentation for full model loading and inference examples.")

view raw JSON →