Coqui TTS
Coqui TTS is a deep learning toolkit for Text-to-Speech synthesis, providing state-of-the-art models and training utilities. It's actively maintained with frequent releases, currently at version `0.22.0`, and supports Python versions from 3.9 to 3.11. It's used for generating high-quality synthetic speech from text, supporting various languages and speaker styles.
Common errors
-
ModuleNotFoundError: No module named 'TTS'
cause The `tts` package is not installed in the current Python environment, or the environment where it was installed is not active.fixInstall the package using `pip install tts`. Verify the correct Python environment is activated before running your script. -
RuntimeError: CUDA out of memory. Tried to allocate X GiB (GPU 0; Y GiB total capacity; Z GiB already allocated; W GiB free; P MiB reserved in total by PyTorch)
cause The GPU lacks sufficient memory to process the current operation, often due to a large model (like XTTS v2) or high batch size.fixUse a smaller model, reduce batch size (if applicable), or switch to CPU inference (by setting `device='cpu'` or `gpu=False`). Ensure no other GPU-intensive applications are running. Consider upgrading your GPU or offloading parts of the model if supported. -
FileNotFoundError: [Errno 2] No such file or directory: 'espeak-ng'
cause The `espeak-ng` command-line tool, a common external dependency for many TTS models for phonemization, is not installed or not discoverable in the system's PATH.fixInstall `espeak-ng` on your operating system. For Debian/Ubuntu-based systems: `sudo apt-get install espeak-ng`. For macOS: `brew install espeak-ng`. -
AttributeError: 'TTS' object has no attribute 'speakers'
cause You are attempting to access multi-speaker specific attributes (like `speakers` or `languages`) on a `TTS` instance initialized with a single-speaker model, or a model that does not expose these properties directly.fixVerify the capabilities of the specific model loaded. If it's a single-speaker model, these attributes are not available. For multi-speaker models like XTTS v2, ensure you are passing `speaker_wav` and `language` as arguments to `tts_to_file`.
Warnings
- breaking The primary API for model inference shifted significantly around versions 0.20.0-0.21.0. Older approaches that involved directly importing and instantiating model classes (e.g., `from TTS.vocoder.models.wavernn import WaveRNN`) are largely superseded by the unified `TTS` class from `TTS.api`. While some direct imports might still function, the recommended and supported way to load and use models is via `TTS.api.TTS(model_name='...')`.
- gotcha Models like XTTS v2 are highly resource-intensive, requiring substantial GPU VRAM (e.g., 10GB+) and system RAM (16GB+). Running these models on CPU or under-resourced GPUs can lead to `CUDA out of memory` errors or extremely slow inference speeds.
- gotcha Many multilingual and advanced TTS models rely on external system-level dependencies like `espeak-ng` and `ffmpeg` for phonemization and audio processing. These are not installed by `pip` and must be manually installed on your operating system.
- gotcha Achieving GPU acceleration requires careful management of `torch`, `torchaudio`, and CUDA toolkit versions. Installing `tts` via `pip` usually pulls in CPU versions of `torch` and `torchaudio` if GPU-enabled versions are not pre-installed. Mismatched versions can lead to `CUDA not available` or runtime errors.
Install
-
pip install tts -
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 && pip install tts
Imports
- TTS
from TTS.api import TTS
Quickstart
import torch
from TTS.api import TTS
# Determine device (CUDA if available, otherwise CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Initialize TTS with a common English model (will download if not available)
try:
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC", device=device)
# Generate speech and save to file
text_to_synthesize = "Hello, this is a test from the Coqui TTS library."
output_filepath = "output_audio.wav"
tts.tts_to_file(text=text_to_synthesize, file_path=output_filepath)
print(f"Speech synthesized to {output_filepath}")
except Exception as e:
print(f"An error occurred: {e}")
print("Please ensure you have installed TTS and its dependencies correctly.")
print("For GPU support, install torch with CUDA first (e.g., pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118)")