nnAudio
nnAudio is a GPU-accelerated audio processing toolbox built on PyTorch's 1D convolutional neural networks. It specializes in generating various spectrograms (STFT, Mel, CQT) on-the-fly during deep learning training, allowing for differentiable and trainable Fourier kernels. This approach significantly speeds up spectrogram computation compared to traditional CPU-based libraries. The library is currently at version 0.3.4 and follows an active, milestone-driven release cadence.
Warnings
- breaking The `device` argument for initializing spectrogram layers (e.g., `STFT(device='cuda')`) was removed in version 0.2.0. Layers must now be moved to the desired device using the PyTorch standard `.to(device)` method after initialization.
- deprecated The `nnAudio.Spectrogram` module path is being replaced by `nnAudio.features`. While `nnAudio.Spectrogram` might still function, `nnAudio.features` is the recommended and future-proof import path for all spectrogram classes.
- gotcha For full functionality, including the Griffin-Lim inverse transform, PyTorch version 1.6.0 or higher is required. Using older PyTorch versions might limit certain features.
- gotcha While `librosa` is a common audio library, `nnAudio` is designed to function without it as a strict dependency. Necessary mel filter functions are included internally to prevent forced `librosa` installation issues.
Install
-
pip install nnaudio -
pip install git+https://github.com/KinWaiCheuk/nnAudio.git#subdirectory=Installation
Imports
- features
from nnAudio import features
- STFT
from nnAudio.features import STFT
- MelSpectrogram
from nnAudio.features.mel import MelSpectrogram
Quickstart
import torch
import numpy as np
from nnAudio import features
# Simulate an audio waveform (e.g., from a .wav file)
sr = 16000 # Sample rate
duration = 1 # seconds
t = np.linspace(0, duration, int(sr * duration), endpoint=False)
# Simple sine wave at 440 Hz
song = 0.5 * np.sin(2 * np.pi * 440 * t, dtype=np.float32)
# nnAudio expects a batch dimension, so unsqueeze(0)
x = torch.tensor(song).unsqueeze(0)
# Move to GPU if available, otherwise CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
x = x.to(device)
# Initialize a STFT spectrogram layer
# Pass sample rate (sr) to the layer
spec_layer = features.STFT(n_fft=2048, hop_length=512, sr=sr).to(device)
# Feed-forward your waveform to get the spectrogram
spectrogram = spec_layer(x)
print(f"Input waveform shape: {x.shape}")
print(f"Output spectrogram shape: {spectrogram.shape}")
print(f"Spectrogram layer on device: {next(iter(spec_layer.parameters())).device}")