Silero Voice Activity Detector (VAD)
Silero VAD is a state-of-the-art Voice Activity Detector (VAD) provided by Silero, built with PyTorch. It helps identify speech segments within audio, offering improved quality and performance across various languages and noisy environments. The current version is 6.2.1, and the library maintains an active release cadence with regular updates to models and features.
Warnings
- breaking As of v6.2.1, `onnxruntime` is no longer a required dependency for `silero-vad`. If you plan to use the ONNX version of the models, you must explicitly install `onnxruntime` (or `onnxruntime-gpu`) yourself. Failing to do so will result in errors if `onnx=True` is passed to `torch.hub.load`.
- breaking Version 6.0 introduced a 'New v6 VAD' model with improved quality and a changed training algorithm. While generally better, this might mean that existing applications tuned for older models (v5, v4) could exhibit different behavior, require re-tuning parameters, or see changes in speech detection sensitivity.
- breaking Version 5.0 introduced significant changes, including a 3x faster inference, a 2x larger model size, and vastly improved quality supporting over 6000 languages. Applications relying on previous model versions (v4) for specific performance characteristics or model size might need to update their pipelines or resource estimations.
- gotcha The Silero VAD model and its core utilities (`get_speech_timestamps`, `VADIterator`, etc.) are primarily loaded using `torch.hub.load` directly from the `snakers4/silero-vad` GitHub repository. Attempting to import these functions directly from the installed `silero_vad` Python package (e.g., `from silero_vad.utils import get_speech_timestamps`) will likely fail or lead to unexpected behavior, as the package serves primarily as an installer/wrapper.
- gotcha The VAD models expect audio to be at a specific sampling rate (most commonly 16kHz, though some older models supported 8kHz, and v4 supports both 8k/16k for ONNX). Providing audio with a mismatched sampling rate will lead to incorrect or degraded VAD performance without explicit errors.
Install
-
pip install silero-vad -
pip install onnxruntime -
pip install onnxruntime-gpu
Imports
- torch
import torch
- torchaudio
import torchaudio
- silero_vad model and utils
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=True)
- get_speech_timestamps
from utils import get_speech_timestamps
- VADIterator
from utils import VADIterator
Quickstart
import torch
import torchaudio
import numpy as np
# Ensure PyTorch is installed and CUDA if available
if not torch.cuda.is_available():
print("Warning: CUDA not available, using CPU for VAD.")
# Load the Silero VAD model and utilities from torch hub
# force_reload=True ensures you get the latest version from the repo
# onnx=True if you have onnxruntime installed and want to use ONNX model
model, utils = torch.hub.load(
repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True,
onnx=False # Set to True if onnxruntime is installed and preferred
)
# Destructure utilities
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils
# Define the sampling rate required by the model (e.g., 16000 Hz)
SAMPLING_RATE = 16000
# Create dummy audio for demonstration (10 seconds, 16kHz)
samples = SAMPLING_RATE * 10
dummy_audio = torch.randn(samples, dtype=torch.float32)
# Resample dummy audio to the required sampling rate if it's not already
# (In a real scenario, you'd read your audio file with torchaudio.load)
if SAMPLING_RATE != torchaudio.get_sample_rate(dummy_audio):
# This part is illustrative; dummy_audio is already at 16k here
# For real audio: audio, sr = torchaudio.load('your_audio.wav')
# if sr != SAMPLING_RATE: audio = torchaudio.functional.resample(audio, orig_freq=sr, new_freq=SAMPLING_RATE)
pass
# Process the audio to get speech timestamps
speech_timestamps = get_speech_timestamps(dummy_audio, model, sampling_rate=SAMPLING_RATE)
print(f"Speech timestamps detected: {speech_timestamps}")
# Example of using VADIterator for real-time processing (requires audio chunks)
vad_iterator = VADIterator(model, sampling_rate=SAMPLING_RATE)
# Simulate processing small chunks
chunk_size = SAMPLING_RATE * 0.5 # 0.5 second chunks
for i in range(0, dummy_audio.shape[0], int(chunk_size)):
chunk = dummy_audio[i:i + int(chunk_size)]
if chunk.shape[0] < chunk_size: # Handle last chunk
continue
speech_dict = vad_iterator(chunk, return_seconds=True)
if speech_dict:
print(f"Speech detected in chunk starting at {i/SAMPLING_RATE:.2f}s: {speech_dict}")
vad_iterator.reset_states() # Reset internal states after processing