High Fidelity Neural Audio Codec
EnCodec is a Python library from Facebook AI that provides a state-of-the-art deep learning based audio codec. It supports both mono 24 kHz and stereo 48 kHz audio, offering various compression rates. It leverages a streaming encoder-decoder architecture with a quantized latent space and an adversarial loss for high-fidelity audio. The current stable version is 0.1.1, with development continuing on GitHub and integration into Hugging Face Transformers.
Warnings
- gotcha The original `encodec` library does not handle very long audio files gracefully. It processes the entire file at once, which can lead to high memory consumption and Out-of-Memory (OOM) errors. The developers have stated they do not currently support this use case.
- breaking To use `encodec` via Hugging Face Transformers (as often recommended), the `transformers` library must be installed from its `main` GitHub branch, not the PyPI stable release, because `encodec` integration might be newer than the latest `transformers` release.
- gotcha The 48 kHz stereo Encodec model processes audio in 1-second chunks with a 1% overlap and renormalizes the audio to unit scale. When extracting discrete representations, `model.encode(wav)` will return a list of `(codes, scale)` tuples, one for each 1-second frame. This behavior differs from the 24 kHz model.
- gotcha Ensure a reasonably recent version of PyTorch (ideally 1.11.0 or newer) is installed. Older PyTorch versions (e.g., <1.8) may have compatibility issues, such as different default values for `torch.stft(return_complex)` within `encodec`'s internal audio processing.
Install
-
pip install -U encodec
Imports
- EncodecModel
from encodec import EncodecModel
- EncodecModel (via transformers)
from transformers import EncodecModel, AutoProcessor
Quickstart
import torch
from datasets import load_dataset, Audio
from transformers import EncodecModel, AutoProcessor
# NOTE: For a real application, you would load your own audio file.
# For quickstart, using a dummy dataset from Hugging Face.
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample_audio = librispeech_dummy[0]["audio"]['array']
sample_rate = librispeech_dummy[0]["audio"]['sampling_rate']
# Load pre-trained Encodec model and processor (24khz monophonic model example)
model = EncodecModel.from_pretrained("facebook/encodec_24khz")
processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")
# Pre-process the audio
inputs = processor(
raw_audio=sample_audio,
sampling_rate=sample_rate,
return_tensors="pt"
)
# Encode the audio. You can specify a bandwidth (e.g., 1.5, 3.0, 6.0, 12.0, 24.0 kbps).
# Default is 1.5 kbps if not specified. Example: encoded_frames = model.encode(inputs["input_values"], bandwidth=3.0)
encoded_frames = model.encode(inputs["input_values"])
# Decode the audio
decoded_audio = model.decode(encoded_frames)
print(f"Original audio shape: {inputs['input_values'].shape}")
print(f"Decoded audio shape: {decoded_audio.shape}")
print("Audio encoded and decoded successfully!")