WebRTC Voice Activity Detector
webrtcvad is a Python interface to the Google WebRTC Voice Activity Detector (VAD). It classifies short segments of audio as being voiced or unvoiced, useful for telephony and speech recognition. The current version is 2.0.10, with releases historically following an infrequent, as-needed cadence to incorporate upstream WebRTC VAD changes or bug fixes.
Warnings
- gotcha The WebRTC VAD has strict audio input requirements: 16-bit, mono PCM audio, sampled at 8000, 16000, 32000, or 48000 Hz. Frames must be exactly 10, 20, or 30 ms in duration.
- gotcha The `webrtcvad` package (wiseman/py-webrtcvad) can be difficult to install on some platforms due to its C/C++ dependencies and lack of pre-built wheels for all Python versions/OS combinations.
- gotcha Version 2.0.10 fixed a memory leak in the `is_speech()` method. While 2.0.10 should contain this fix, later versions of the `webrtcvad-wheels` fork (e.g., 2.0.13) address further memory leak issues.
- gotcha The WebRTC VAD is a simple, real-time oriented model and may produce false positives for non-speech sounds (e.g., music, birdsong) or false negatives in very noisy environments, even at high aggressiveness settings.
Install
-
pip install webrtcvad
Imports
- Vad
import webrtcvad vad = webrtcvad.Vad()
Quickstart
import webrtcvad
import struct
# WebRTC VAD requires 16-bit mono PCM audio at specific sample rates
# and frame durations (10, 20, or 30 ms).
sample_rate = 16000 # Hz
frame_duration_ms = 30 # ms
bytes_per_sample = 2 # 16-bit audio
# Calculate frame size in bytes
frame_size_bytes = int(sample_rate * (frame_duration_ms / 1000.0) * bytes_per_sample)
# Create a VAD instance with an aggressiveness mode (0-3)
# 0: least aggressive, 3: most aggressive
vad = webrtcvad.Vad(3)
# Create a silent audio frame (16-bit mono PCM)
silence_frame = b'\x00\x00' * int(frame_size_bytes / bytes_per_sample)
# Create a mock speech-like frame (simple sine wave for demonstration)
# In a real application, this would come from an audio input.
speech_frame = b''
for i in range(int(frame_size_bytes / bytes_per_sample)):
# Simple sine wave approximation for a speech-like signal
amplitude = 10000 # Max 32767 for 16-bit
value = int(amplitude * (i % 30 < 15) - amplitude * (i % 30 >= 15)) # Square wave approximation
speech_frame += struct.pack('<h', value)
print(f"Processing frame of {frame_duration_ms} ms at {sample_rate} Hz")
# Test with silence
is_speech_silence = vad.is_speech(silence_frame, sample_rate)
print(f"Silence frame contains speech: {is_speech_silence}")
# Test with speech-like audio
is_speech_mock = vad.is_speech(speech_frame, sample_rate)
print(f"Mock speech frame contains speech: {is_speech_mock}")
# You can also set the mode after initialization
vad.set_mode(1)
print(f"VAD aggressiveness set to 1.")