Vocos

0.1.0 · active · verified Thu Apr 16

Vocos is a fast neural vocoder for high-quality audio synthesis that generates Fourier spectral coefficients instead of directly modeling time-domain waveforms. It supports reconstruction from Mel spectrograms or EnCodec tokens, offering improved computational efficiency and audio quality compared to traditional time-domain methods. The library is currently at version 0.1.0 and is actively maintained with regular updates.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a pre-trained Vocos model and synthesize an audio waveform from a dummy Mel-spectrogram. It shows basic model instantiation, input preparation, and the decoding process.

import torch
from vocos import Vocos

# Instantiate Vocos model from Hugging Face Hub (Mel-spectrogram variant)
vocos_model = Vocos.from_pretrained("charactr/vocos-mel-24khz")

# Create a dummy Mel-spectrogram tensor (Batch, Channels, Time)
# In a real application, this would come from a feature extraction step on actual audio.
# Example: 1 batch, 100 mel bands, 256 time frames
mel_spectrogram = torch.randn(1, 100, 256)

# Move model and input to GPU if available
if torch.cuda.is_available():
    vocos_model = vocos_model.to('cuda')
    mel_spectrogram = mel_spectrogram.to('cuda')

# Decode the Mel-spectrogram to an audio waveform
audio_waveform = vocos_model.decode(mel_spectrogram)

print(f"Generated audio waveform shape: {audio_waveform.shape}")
print(f"Generated audio on device: {audio_waveform.device}")
# To save the audio, you would typically use torchaudio.save:
# import torchaudio
# torchaudio.save("generated_audio.wav", audio_waveform.cpu(), sample_rate=24000)

view raw JSON →