Torchcrepe
Torchcrepe is a PyTorch implementation of the CREPE pitch tracker, a state-of-the-art monophonic pitch estimation tool based on a deep convolutional neural network. It allows users to compute pitch and periodicity from audio signals, offering functionalities for direct file processing, filtering, thresholding, and various decoding options. The library is actively maintained, with regular updates to its PyPI package.
Warnings
- gotcha Torchcrepe's default Viterbi decoding differs from the original CREPE (TensorFlow) implementation. It uses Viterbi decoding on the softmax output instead of a weighted average, which helps prevent double/half frequency errors but changes the default pitch estimation approach.
- gotcha CREPE models were not trained on silent audio. This can lead to the model assigning high confidence to pitch bins even in silent regions. You may observe spurious pitch predictions in quiet sections.
- gotcha The `batch_size` argument in `torchcrepe.predict` refers to internal batching over audio frames, not directly to processing multiple distinct audio files in a single call. Feeding multiple audio files of varying lengths in a batch for `predict` is not straightforward and might not offer the expected speed benefits due to padding overhead and other design choices.
Install
-
pip install torchcrepe
Imports
- torchcrepe
import torchcrepe
- predict
from torchcrepe import predict
- load.audio
from torchcrepe.load import audio
Quickstart
import torch
import torchcrepe
import numpy as np
# Mock torchcrepe.load.audio for a runnable example without external files
class MockLoadAudio:
def audio(self, *args, **kwargs):
# Generate a dummy 16kHz sine wave audio (1 second)
sr = 16000
duration = 1.0
frequency = 440.0 # Hz
t = np.linspace(0., duration, int(sr * duration), endpoint=False)
audio_np = 0.5 * np.sin(2 * np.pi * frequency * t).astype(np.float32)
return torch.from_numpy(audio_np).unsqueeze(0), sr # unsqueeze for batch dimension
torchcrepe.load = MockLoadAudio()
# Load dummy audio
audio, sr = torchcrepe.load.audio('dummy.wav', sr=16000)
# Here we'll use a 5 millisecond hop length
hop_length = int(sr / 200.)
# Provide a sensible frequency range for your domain (upper limit is 2006 Hz)
# This would be a reasonable range for speech
fmin = 50
fmax = 550
# Select a model capacity--one of "tiny" or "full"
model = 'tiny'
# Choose a device to use for inference
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
# Pick a batch size that doesn't cause memory errors on your gpu
batch_size = 2048 # Note: Batching here refers to internal frame processing, not input audio files
# Compute pitch
pitch = torchcrepe.predict(
audio,
sr,
hop_length,
fmin,
fmax,
model,
batch_size=batch_size,
device=device,
return_periodicity=False # Set to True to get a confidence score
)
print(f"Predicted pitch shape: {pitch.shape}")
if pitch.shape[-1] > 0:
print(f"First few pitch values: {pitch[0, :5].tolist()}")