Descript Audio Codec
Descript Audio Codec (DAC) is a high-fidelity, general neural audio codec, currently at version 1.0.0. It enables compression of audio (supporting 16kHz, 24kHz, and 44.1kHz sampling rates) into discrete codes at very low bitrates, achieving approximately 90x compression for 44.1 kHz audio at 8 kbps while maintaining exceptional fidelity. It is designed to be universal, working across various audio domains including speech, music, and environmental sounds, and can serve as a drop-in replacement for codecs like EnCodec in audio language modeling applications. The library maintains an active development pace with regular updates.
Warnings
- gotcha Model limitations exist for certain audio types; it performs best on speech and may have difficulty reconstructing some musical instruments (e.g., glockenspiel) or complex environmental sounds perfectly.
- gotcha Model weights are automatically downloaded from the internet on the first use of `dac.DAC.load()` or `dac.utils.download()`. This requires an active internet connection for initial setup.
- gotcha Changes in early `0.x` versions (e.g., between 0.0.3 and 0.0.4) modified the storage format of discrete codes to `uint16`. If you saved codes with very old `0.x` versions, they might not be compatible with newer versions of the library.
Install
-
pip install descript-audio-codec
Imports
- DAC
import dac model = dac.DAC.load(...)
- AudioSignal
from audiotools import AudioSignal
Quickstart
import dac
import torch
from audiotools import AudioSignal
import os
# Instantiate the DAC model. Model weights are automatically downloaded and cached.
# You can specify model_type='16khz', '24khz', or '44khz'. Default is '44khz'.
device = "cuda" if torch.cuda.is_available() else "cpu"
model = dac.DAC.load(model_type="44khz").to(device)
# Prepare dummy audio for encoding (e.g., 5 seconds of stereo 44.1kHz noise)
sample_rate = model.sample_rate # 44100
duration = 5 # seconds
channels = 2 # stereo
audio_tensor = torch.randn(channels, sample_rate * duration, device=device)
audio_signal = AudioSignal(audio_tensor, sample_rate)
# Encode the audio to discrete codes and get the reconstructed audio
with torch.no_grad():
# codes: quantized latent representations (discrete tokens)
# latents: continuous latent representations
# audio_out: reconstructed audio (audiotools.AudioSignal)
codes, latents, audio_out = model.encode(audio_signal, return_audio=True)
print(f"Original audio shape: {audio_signal.shape}")
print(f"Encoded codes shape (batch, num_quantizers, sequence_length): {codes.shape}")
print(f"Decoded audio shape: {audio_out.shape}")
# In a real application, you might save `codes` for storage/transmission
# or `audio_out` to a file.
# Example (requires audiotools):
# audio_out.write("reconstructed_audio.wav")
# Example: torch.save(codes, "compressed_audio_codes.pt")