OmniVoice

raw JSON →
0.1.5 verified Sat May 09 auth: no python

OmniVoice is a zero-shot text-to-speech library using diffusion language models. It supports multilingual TTS with voice cloning from short audio samples. Current version 0.1.5, actively maintained. Requires Python >= 3.10.

pip install omnivoice
error RuntimeError: Audio length mismatch
cause Reference audio and text lengths do not align, or audio is too long (>30s recommended).
fix
Trim reference audio to 3-30 seconds and ensure the text corresponds exactly.
error AttributeError: module 'torchaudio' has no attribute 'resample'
cause torchaudio version is too old (<0.12) for resample function.
fix
Install torchaudio >= 0.12: pip install --upgrade torchaudio
error ImportError: cannot import name 'OmniVoice' from 'omnivoice'
cause Incorrect import path; older documentation showed wrong path.
fix
Use 'from omnivoice import OmniVoice' instead of 'from omnivoice.model import OmniVoice'.
error ValueError: The truth value of an array with more than one element is ambiguous
cause Passing stereo audio as reference; expects mono.
fix
Convert reference audio to mono with torchaudio.functional.to_mono().
error FileNotFoundError: No such file or directory: 'path/to/model'
cause Model not downloaded or cache path misconfigured.
fix
Ensure internet connection for first download, or set OMNIVOICE_CACHE_DIR to a valid path.
breaking Model loading without internet will fail if cache is missing. Use local pretrained path explicitly.
fix Set OMNIVOICE_CACHE_DIR or download model files manually.
gotcha Reference audio must be monophonic and at 24kHz sample rate. Mismatch causes quality degradation.
fix Resample audio to 24000 Hz and convert to mono before passing.
gotcha Inference on MPS (Apple Silicon) may fail due to unsupported operations. Use CPU or CUDA.
fix Set device='cpu' explicitly when using MPS.
deprecated The `load_asr` argument in model loading is deprecated. ASR model is now loaded automatically.
fix Remove `load_asr=True` from `OmniVoice.from_pretrained`.

Basic TTS inference with voice cloning.

from omnivoice import OmniVoice, infer

# Load model
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice")

# Synthesize speech
audio = infer(model, text="Hello world", reference_audio="ref.wav", reference_text="The quick brown fox")

# Save to file
import torchaudio
torchaudio.save("output.wav", audio.unsqueeze(0), 24000)