Vocos
Vocos is a fast neural vocoder for high-quality audio synthesis that generates Fourier spectral coefficients instead of directly modeling time-domain waveforms. It supports reconstruction from Mel spectrograms or EnCodec tokens, offering improved computational efficiency and audio quality compared to traditional time-domain methods. The library is currently at version 0.1.0 and is actively maintained with regular updates.
Common errors
-
ModuleNotFoundError: No module named 'vocos'
cause The `vocos` library is not installed in the active Python environment or is not in the Python path.fixInstall the library using `pip install vocos` or ensure the correct virtual environment where `vocos` is installed is activated. -
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) don't match
cause The input tensor (e.g., Mel spectrogram) is on the CPU, while the Vocos model is on the GPU (or vice-versa), leading to a device mismatch in PyTorch operations.fixEnsure both the model and all input tensors are on the same device. For example, move them to GPU with `.to('cuda')` or to CPU with `.to('cpu')`. -
ValueError: 'bandwidth_id' must be one of [0, 1, 2, 3] (corresponding to [1.5, 3.0, 6.0, 12.0] kbps)
cause An invalid `bandwidth_id` was provided when using an EnCodec-based Vocos model. The parameter expects an index or value corresponding to specific bandwidths.fixRefer to the model's documentation or source code to determine the correct mapping for `bandwidth_id`. Typically, it will be an index (0-3) or the actual kbps value from the allowed set `[1.5, 3.0, 6.0, 12.0]`. -
Generated audio sounds buzzy or contains artifacts during training.
cause This is often related to the weighting of the multi-resolution discriminator (MRD) loss during training.fixExperiment with the `mrd_loss_coeff` hyperparameter. Setting it to `1.0` from the start of training can help reduce 'buzziness' in the audio output, although it might slightly slow down convergence in terms of UTMOS score.
Warnings
- breaking Version 0.1.0 introduced a new multi-resolution (+multi-band) discriminator and updated recommended hyperparameters for the AdamW optimizer (lr=5e-4, betas=(0.8, 0.9)). Pre-trained models on Hugging Face were also updated. If you're fine-tuning or training a model based on earlier versions, you may need to adjust your training setup and hyperparameters for optimal results.
- gotcha When training Vocos, especially on Windows, users have reported incompatibilities or difficulties with `torchaudio` if it requires building with `sox`. This can lead to issues with the `vocos[train]` installation.
- gotcha When reconstructing audio from EnCodec tokens, you must provide a `bandwidth_id` parameter. This ID corresponds to specific bandwidths (kbps): `[1.5, 3.0, 6.0, 12.0]`. Using any other value will result in incorrect behavior or errors.
- gotcha Vocos models are primarily trained for speech synthesis. Using them for other audio domains (e.g., music, general sound effects) might result in lower quality outputs compared to their performance on speech.
Install
-
pip install vocos -
pip install vocos[train]
Imports
- Vocos
from vocos import Vocos
Quickstart
import torch
from vocos import Vocos
# Instantiate Vocos model from Hugging Face Hub (Mel-spectrogram variant)
vocos_model = Vocos.from_pretrained("charactr/vocos-mel-24khz")
# Create a dummy Mel-spectrogram tensor (Batch, Channels, Time)
# In a real application, this would come from a feature extraction step on actual audio.
# Example: 1 batch, 100 mel bands, 256 time frames
mel_spectrogram = torch.randn(1, 100, 256)
# Move model and input to GPU if available
if torch.cuda.is_available():
vocos_model = vocos_model.to('cuda')
mel_spectrogram = mel_spectrogram.to('cuda')
# Decode the Mel-spectrogram to an audio waveform
audio_waveform = vocos_model.decode(mel_spectrogram)
print(f"Generated audio waveform shape: {audio_waveform.shape}")
print(f"Generated audio on device: {audio_waveform.device}")
# To save the audio, you would typically use torchaudio.save:
# import torchaudio
# torchaudio.save("generated_audio.wav", audio_waveform.cpu(), sample_rate=24000)