Silero Voice Activity Detector (VAD)

6.2.1 · active · verified Sat Apr 11

Silero VAD is a state-of-the-art Voice Activity Detector (VAD) provided by Silero, built with PyTorch. It helps identify speech segments within audio, offering improved quality and performance across various languages and noisy environments. The current version is 6.2.1, and the library maintains an active release cadence with regular updates to models and features.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load the Silero VAD model and its associated utilities using `torch.hub.load`. It then generates dummy audio, processes it to detect speech segments using `get_speech_timestamps`, and also illustrates the use of `VADIterator` for processing audio in chunks, useful for real-time applications. Ensure PyTorch and Torchaudio are installed, and optionally `onnxruntime` if ONNX inference is desired.

import torch
import torchaudio
import numpy as np

# Ensure PyTorch is installed and CUDA if available
if not torch.cuda.is_available():
    print("Warning: CUDA not available, using CPU for VAD.")

# Load the Silero VAD model and utilities from torch hub
# force_reload=True ensures you get the latest version from the repo
# onnx=True if you have onnxruntime installed and want to use ONNX model
model, utils = torch.hub.load(
    repo_or_dir='snakers4/silero-vad',
    model='silero_vad',
    force_reload=True,
    onnx=False # Set to True if onnxruntime is installed and preferred
)

# Destructure utilities
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

# Define the sampling rate required by the model (e.g., 16000 Hz)
SAMPLING_RATE = 16000

# Create dummy audio for demonstration (10 seconds, 16kHz)
samples = SAMPLING_RATE * 10
dummy_audio = torch.randn(samples, dtype=torch.float32)

# Resample dummy audio to the required sampling rate if it's not already
# (In a real scenario, you'd read your audio file with torchaudio.load)
if SAMPLING_RATE != torchaudio.get_sample_rate(dummy_audio):
    # This part is illustrative; dummy_audio is already at 16k here
    # For real audio: audio, sr = torchaudio.load('your_audio.wav')
    # if sr != SAMPLING_RATE: audio = torchaudio.functional.resample(audio, orig_freq=sr, new_freq=SAMPLING_RATE)
    pass

# Process the audio to get speech timestamps
speech_timestamps = get_speech_timestamps(dummy_audio, model, sampling_rate=SAMPLING_RATE)

print(f"Speech timestamps detected: {speech_timestamps}")

# Example of using VADIterator for real-time processing (requires audio chunks)
vad_iterator = VADIterator(model, sampling_rate=SAMPLING_RATE)
# Simulate processing small chunks
chunk_size = SAMPLING_RATE * 0.5 # 0.5 second chunks
for i in range(0, dummy_audio.shape[0], int(chunk_size)):
    chunk = dummy_audio[i:i + int(chunk_size)]
    if chunk.shape[0] < chunk_size: # Handle last chunk
        continue
    speech_dict = vad_iterator(chunk, return_seconds=True)
    if speech_dict:
        print(f"Speech detected in chunk starting at {i/SAMPLING_RATE:.2f}s: {speech_dict}")

vad_iterator.reset_states() # Reset internal states after processing

view raw JSON →