{"id":7851,"library":"vocos","title":"Vocos","description":"Vocos is a fast neural vocoder for high-quality audio synthesis that generates Fourier spectral coefficients instead of directly modeling time-domain waveforms. It supports reconstruction from Mel spectrograms or EnCodec tokens, offering improved computational efficiency and audio quality compared to traditional time-domain methods. The library is currently at version 0.1.0 and is actively maintained with regular updates.","status":"active","version":"0.1.0","language":"en","source_language":"en","source_url":"https://github.com/charactr-platform/vocos","tags":["audio","vocoder","synthesis","neural network","speech","deep learning","transformers","text-to-speech","AI","generative AI"],"install":[{"cmd":"pip install vocos","lang":"bash","label":"For inference only"},{"cmd":"pip install vocos[train]","lang":"bash","label":"For training with additional dependencies"}],"dependencies":[{"reason":"Required Python version","package":"python","version":">=3.9","optional":false},{"reason":"Core deep learning framework","package":"torch","optional":false},{"reason":"Audio I/O and processing, especially for training","package":"torchaudio","optional":false},{"reason":"Tensor manipulations","package":"einops","optional":false},{"reason":"For EnCodec token processing","package":"encodec","optional":false},{"reason":"For loading pre-trained models","package":"huggingface-hub","optional":false},{"reason":"Numerical operations","package":"numpy","optional":false},{"reason":"Configuration handling","package":"pyyaml","optional":false},{"reason":"Scientific computing utilities","package":"scipy","optional":false}],"imports":[{"symbol":"Vocos","correct":"from vocos import Vocos"}],"quickstart":{"code":"import torch\nfrom vocos import Vocos\n\n# Instantiate Vocos model from Hugging Face Hub (Mel-spectrogram variant)\nvocos_model = Vocos.from_pretrained(\"charactr/vocos-mel-24khz\")\n\n# Create a dummy Mel-spectrogram tensor (Batch, Channels, Time)\n# In a real application, this would come from a feature extraction step on actual audio.\n# Example: 1 batch, 100 mel bands, 256 time frames\nmel_spectrogram = torch.randn(1, 100, 256)\n\n# Move model and input to GPU if available\nif torch.cuda.is_available():\n    vocos_model = vocos_model.to('cuda')\n    mel_spectrogram = mel_spectrogram.to('cuda')\n\n# Decode the Mel-spectrogram to an audio waveform\naudio_waveform = vocos_model.decode(mel_spectrogram)\n\nprint(f\"Generated audio waveform shape: {audio_waveform.shape}\")\nprint(f\"Generated audio on device: {audio_waveform.device}\")\n# To save the audio, you would typically use torchaudio.save:\n# import torchaudio\n# torchaudio.save(\"generated_audio.wav\", audio_waveform.cpu(), sample_rate=24000)","lang":"python","description":"This quickstart demonstrates how to load a pre-trained Vocos model and synthesize an audio waveform from a dummy Mel-spectrogram. It shows basic model instantiation, input preparation, and the decoding process."},"warnings":[{"fix":"Review the latest release notes and Hugging Face model cards for updated training configurations and hyperparameters. Consider re-evaluating your training scripts.","message":"Version 0.1.0 introduced a new multi-resolution (+multi-band) discriminator and updated recommended hyperparameters for the AdamW optimizer (lr=5e-4, betas=(0.8, 0.9)). Pre-trained models on Hugging Face were also updated. If you're fine-tuning or training a model based on earlier versions, you may need to adjust your training setup and hyperparameters for optimal results.","severity":"breaking","affected_versions":">=0.1.0"},{"fix":"Ensure your `torchaudio` installation is compatible with your environment. For specific issues, consult `torchaudio` documentation or consider using a Linux-based environment for training if encountering persistent problems.","message":"When training Vocos, especially on Windows, users have reported incompatibilities or difficulties with `torchaudio` if it requires building with `sox`. This can lead to issues with the `vocos[train]` installation.","severity":"gotcha","affected_versions":"*"},{"fix":"Ensure the `bandwidth_id` provided to the model's `decode` or `decode_from_codes` method is one of the supported values: `1.5`, `3.0`, `6.0`, or `12.0`. Pass it as a `torch.tensor` if required.","message":"When reconstructing audio from EnCodec tokens, you must provide a `bandwidth_id` parameter. This ID corresponds to specific bandwidths (kbps): `[1.5, 3.0, 6.0, 12.0]`. Using any other value will result in incorrect behavior or errors.","severity":"gotcha","affected_versions":"*"},{"fix":"Be aware of the model's intended use. If targeting non-speech audio, conduct thorough evaluation of the output quality.","message":"Vocos models are primarily trained for speech synthesis. Using them for other audio domains (e.g., music, general sound effects) might result in lower quality outputs compared to their performance on speech.","severity":"gotcha","affected_versions":"*"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Install the library using `pip install vocos` or ensure the correct virtual environment where `vocos` is installed is activated.","cause":"The `vocos` library is not installed in the active Python environment or is not in the Python path.","error":"ModuleNotFoundError: No module named 'vocos'"},{"fix":"Ensure both the model and all input tensors are on the same device. For example, move them to GPU with `.to('cuda')` or to CPU with `.to('cpu')`.","cause":"The input tensor (e.g., Mel spectrogram) is on the CPU, while the Vocos model is on the GPU (or vice-versa), leading to a device mismatch in PyTorch operations.","error":"RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) don't match"},{"fix":"Refer to the model's documentation or source code to determine the correct mapping for `bandwidth_id`. Typically, it will be an index (0-3) or the actual kbps value from the allowed set `[1.5, 3.0, 6.0, 12.0]`.","cause":"An invalid `bandwidth_id` was provided when using an EnCodec-based Vocos model. The parameter expects an index or value corresponding to specific bandwidths.","error":"ValueError: 'bandwidth_id' must be one of [0, 1, 2, 3] (corresponding to [1.5, 3.0, 6.0, 12.0] kbps)"},{"fix":"Experiment with the `mrd_loss_coeff` hyperparameter. Setting it to `1.0` from the start of training can help reduce 'buzziness' in the audio output, although it might slightly slow down convergence in terms of UTMOS score.","cause":"This is often related to the weighting of the multi-resolution discriminator (MRD) loss during training.","error":"Generated audio sounds buzzy or contains artifacts during training."}]}