pyvad
Pyvad is a Python wrapper for the `py-webrtcvad` library, designed for trimming speech clips from audio. It provides a simplified interface for Voice Activity Detection (VAD) functionality, allowing users to identify and extract voiced segments from audio data. The current version is 0.2.0, released in July 2022, with an infrequent release cadence.
Common errors
-
ValueError: fs_vad must be 8000, 16000, 32000 or 48000.
cause The `fs_vad` parameter provided to the `vad` or `trim` function is not one of the supported sampling frequencies for the WebRTC VAD.fixChange the `fs_vad` argument to 8000, 16000, 32000, or 48000. For example, `vad(data, fs, fs_vad=16000)`. You might need to resample your audio `data` to match `fs_vad` using `librosa.resample` if `fs` is different. -
ValueError: hop_length must be 10, 20, or 30.
cause The `hop_length` parameter (frame duration in milliseconds) is not one of the allowed values.fixSet the `hop_length` argument to 10, 20, or 30. For example, `vad(data, fs, hop_length=30)`. -
ValueError: When data.type is float, data must be -1.0 <= data <= 1.0. (or similar for int data type)
cause The input audio `data` is not normalized correctly for its data type. `float` data must be between -1.0 and 1.0, and `int` data must fit within a 16-bit PCM range (-32768 to 32767).fixNormalize your `float` audio data by dividing it by its maximum absolute value, or ensure `int` data is correctly scaled. For example, `audio_data = audio_data / np.max(np.abs(audio_data))` for float data. -
ImportError: cannot import name 'vad' from 'pyvad' (.../pyvad/__init__.py)
cause This error can occur if pyvad or one of its dependencies (`py-webrtcvad`) failed to install correctly, or if there's a Python version incompatibility, especially with older Python versions not supported by pyvad 0.2.0.fixFirst, ensure your Python version is 3.8 or 3.9 as required by `pyvad` 0.2.0. Then, try reinstalling `pyvad` and its dependencies with `pip install --upgrade --no-cache-dir pyvad py-webrtcvad numpy librosa`. Check the installation logs for any compilation errors, particularly for `py-webrtcvad`.
Warnings
- breaking Pyvad v0.2.0 introduced significant breaking changes, specifically restricting Python version compatibility to 3.8 and 3.9 only. Earlier versions might support Python 3.6+ or even Python 2.x (for pre-0.1.0 versions). Additionally, the `hoplength` argument was renamed to `hop_length`, and the `trim` function's `return_sec` argument was removed, with `trim` now returning `(start_index, end_index)` directly.
- gotcha The underlying `webrtcvad` library, and by extension `pyvad`, has strict requirements for audio input parameters. The `fs_vad` (internal sampling frequency for VAD) must be 8000, 16000, 32000, or 48000 Hz, and `hop_length` (frame duration) must be 10, 20, or 30 milliseconds. Input `data` must be mono and scaled correctly: if `int`, between -32768 and 32767; if `float`, between -1.0 and 1.0. Failure to meet these requirements will result in `ValueError` exceptions.
- gotcha The `vad_mode` parameter, controlling aggressiveness, must be an integer between 0 and 3. A higher value (e.g., 3) makes the VAD more aggressive in filtering out non-speech, while a lower value (e.g., 0) is less aggressive. Using a value outside this range will raise a `ValueError`.
Install
-
pip install pyvad
Imports
- vad
from pyvad import vad
- trim
from pyvad import trim
Quickstart
import numpy as np
from pyvad import vad, trim
# Simulate audio data (e.g., 1 second of speech, 1 second of silence)
fs = 16000 # Sample rate in Hz (WebRTC VAD supported rate)
duration_speech = 1.0 # seconds
duration_silence = 1.0 # seconds
# Generate a simple sine wave for 'speech'
t = np.linspace(0, duration_speech, int(fs * duration_speech), endpoint=False)
speech_data = 0.5 * np.sin(2 * np.pi * 440 * t) # 440 Hz sine wave
# Generate silence
silence_data = np.zeros(int(fs * duration_silence))
# Combine speech and silence
audio_data = np.concatenate((silence_data, speech_data, silence_data)).astype(np.float32)
print(f"Audio data shape: {audio_data.shape}, Sample rate: {fs} Hz")
# Perform Voice Activity Detection
vact = vad(audio_data, fs)
print(f"Voice activity array shape: {vact.shape}")
# vact will contain 1s for voiced segments, 0s for unvoiced
# Trim silence from the audio
trimmed_audio, (start_idx, end_idx) = trim(audio_data, fs)
print(f"Trimmed audio shape: {trimmed_audio.shape}")
print(f"Original audio length: {len(audio_data) / fs:.2f}s")
print(f"Trimmed audio from {start_idx/fs:.2f}s to {end_idx/fs:.2f}s, total {len(trimmed_audio)/fs:.2f}s")