Qwen-TTS
Qwen-TTS is a powerful text-to-speech (TTS) synthesis library developed by the Qwen team (Alibaba Cloud). It enables high-quality speech generation from text, supporting various languages and speaking styles. The library is currently at version 0.1.1 and is under active development, with updates typically coinciding with major model releases or feature improvements.
Common errors
-
RuntimeError: No available GPU(s) found
cause The system attempted to load the model on a CUDA device, but no compatible GPU with a correctly configured PyTorch/CUDA environment was detected.fixExplicitly set `device='cpu'` when loading the model: `model = QwenTTS.from_pretrained('Qwen/Qwen3-TTS', device='cpu')`. Ensure your PyTorch installation matches your CUDA toolkit version if you intend to use a GPU. -
OSError: Can't load tokenizer for 'Qwen/Qwen3-TTS'. If you were trying to load it from 'https://huggingface.co/Qwen/Qwen3-TTS', make sure you don't have a local directory with the same name.
cause The model or tokenizer files could not be downloaded or found locally. This often indicates network issues, incorrect model ID, or corrupted cached files.fixVerify your internet connection and ensure direct access to Hugging Face Hub. Check the model ID for typos. If the issue persists, try clearing the Hugging Face cache: `rm -rf ~/.cache/huggingface/`. -
ModuleNotFoundError: No module named 'qwen_tts'
cause The `qwen-tts` library is not installed in the current Python environment.fixInstall the library using pip: `pip install qwen-tts`.
Warnings
- gotcha Qwen-TTS relies heavily on PyTorch and other deep learning dependencies. Ensuring correct installation, especially for GPU (CUDA) acceleration, is crucial. Mismatched CUDA versions between your system, PyTorch, and other libraries can lead to runtime errors or poor performance.
- gotcha The `QwenTTS.from_pretrained()` method downloads model weights from Hugging Face Hub. This requires an active internet connection and significant disk space (several GBs for the model). Slow connections or network issues can cause downloads to fail or be very slow.
- gotcha The `frontend.get_text_token_and_style_token()` method requires valid `language` and `style_name` parameters. Using unsupported languages or style names (e.g., 'happy' for a language that only supports 'neutral') will result in errors.
Install
-
pip install qwen-tts -
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Imports
- QwenTTS
from qwen_tts import QwenTTS
from qwen_tts.models import QwenTTS
- get_frontend
from qwen_tts.utils import get_frontend
from qwen_tts.frontend import get_frontend
Quickstart
import torch
import soundfile as sf
from qwen_tts.frontend import get_frontend
from qwen_tts.models import QwenTTS
# Define text and style for synthesis
text = "Hello, this is a test from Qwen TTS, demonstrating speech synthesis."
language = "en"
style_name = "neutral" # Other options: 'happy', 'sad', etc.
# Determine device for model loading (GPU if available, else CPU)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Attempting to load model on: {device}")
# Load the QwenTTS model from Hugging Face Hub
try:
model = QwenTTS.from_pretrained('Qwen/Qwen3-TTS', device=device)
except Exception as e:
print(f"Failed to load model on {device}: {e}. Retrying with 'cpu'.")
device = 'cpu'
model = QwenTTS.from_pretrained('Qwen/Qwen3-TTS', device=device)
# Initialize the frontend for text processing
# The exp_name is retrieved from the loaded model's hyperparameters
frontend = get_frontend(model.hparams.data.exp_name)
# Get text and style tokens from the frontend
text_token, style_token = frontend.get_text_token_and_style_token(
text=text,
language=language,
style_name=style_name
)
# Synthesize speech using the model
output = model.synthesize(text_token, style_token)
wav = output['wav'][0].cpu().numpy() # Extract waveform and move to CPU
sampling_rate = model.hparams.data.sampling_rate
# Save the synthesized audio to a WAV file
output_filename = "qwen_tts_output.wav"
sf.write(output_filename, wav, sampling_rate)
print(f"Speech synthesized and saved to {output_filename}")