WebRTC Voice Activity Detector

2.0.10 · active · verified Mon Apr 13

webrtcvad is a Python interface to the Google WebRTC Voice Activity Detector (VAD). It classifies short segments of audio as being voiced or unvoiced, useful for telephony and speech recognition. The current version is 2.0.10, with releases historically following an infrequent, as-needed cadence to incorporate upstream WebRTC VAD changes or bug fixes.

Warnings

Install

Imports

Quickstart

Initializes the VAD, sets its aggressiveness mode, and demonstrates classifying silence and a mock speech segment. It highlights the strict audio format requirements: 16-bit mono PCM at 8000, 16000, 32000, or 48000 Hz, with frame durations of 10, 20, or 30 ms.

import webrtcvad
import struct

# WebRTC VAD requires 16-bit mono PCM audio at specific sample rates
# and frame durations (10, 20, or 30 ms).
sample_rate = 16000 # Hz
frame_duration_ms = 30 # ms
bytes_per_sample = 2 # 16-bit audio

# Calculate frame size in bytes
frame_size_bytes = int(sample_rate * (frame_duration_ms / 1000.0) * bytes_per_sample)

# Create a VAD instance with an aggressiveness mode (0-3)
# 0: least aggressive, 3: most aggressive
vad = webrtcvad.Vad(3)

# Create a silent audio frame (16-bit mono PCM)
silence_frame = b'\x00\x00' * int(frame_size_bytes / bytes_per_sample)

# Create a mock speech-like frame (simple sine wave for demonstration)
# In a real application, this would come from an audio input.
speech_frame = b''
for i in range(int(frame_size_bytes / bytes_per_sample)):
    # Simple sine wave approximation for a speech-like signal
    amplitude = 10000 # Max 32767 for 16-bit
    value = int(amplitude * (i % 30 < 15) - amplitude * (i % 30 >= 15)) # Square wave approximation
    speech_frame += struct.pack('<h', value)


print(f"Processing frame of {frame_duration_ms} ms at {sample_rate} Hz")

# Test with silence
is_speech_silence = vad.is_speech(silence_frame, sample_rate)
print(f"Silence frame contains speech: {is_speech_silence}")

# Test with speech-like audio
is_speech_mock = vad.is_speech(speech_frame, sample_rate)
print(f"Mock speech frame contains speech: {is_speech_mock}")

# You can also set the mode after initialization
vad.set_mode(1)
print(f"VAD aggressiveness set to 1.")

view raw JSON →