Qwen-TTS

0.1.1 · active · verified Thu Apr 16

Qwen-TTS is a powerful text-to-speech (TTS) synthesis library developed by the Qwen team (Alibaba Cloud). It enables high-quality speech generation from text, supporting various languages and speaking styles. The library is currently at version 0.1.1 and is under active development, with updates typically coinciding with major model releases or feature improvements.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load the Qwen-TTS model, prepare text with its frontend, synthesize speech, and save the output to a WAV file. It includes robust device selection (GPU/CPU) and handles common initialization steps.

import torch
import soundfile as sf
from qwen_tts.frontend import get_frontend
from qwen_tts.models import QwenTTS

# Define text and style for synthesis
text = "Hello, this is a test from Qwen TTS, demonstrating speech synthesis."
language = "en"
style_name = "neutral" # Other options: 'happy', 'sad', etc.

# Determine device for model loading (GPU if available, else CPU)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Attempting to load model on: {device}")

# Load the QwenTTS model from Hugging Face Hub
try:
    model = QwenTTS.from_pretrained('Qwen/Qwen3-TTS', device=device)
except Exception as e:
    print(f"Failed to load model on {device}: {e}. Retrying with 'cpu'.")
    device = 'cpu'
    model = QwenTTS.from_pretrained('Qwen/Qwen3-TTS', device=device)

# Initialize the frontend for text processing
# The exp_name is retrieved from the loaded model's hyperparameters
frontend = get_frontend(model.hparams.data.exp_name)

# Get text and style tokens from the frontend
text_token, style_token = frontend.get_text_token_and_style_token(
    text=text,
    language=language,
    style_name=style_name
)

# Synthesize speech using the model
output = model.synthesize(text_token, style_token)
wav = output['wav'][0].cpu().numpy() # Extract waveform and move to CPU
sampling_rate = model.hparams.data.sampling_rate

# Save the synthesized audio to a WAV file
output_filename = "qwen_tts_output.wav"
sf.write(output_filename, wav, sampling_rate)
print(f"Speech synthesized and saved to {output_filename}")

view raw JSON →