{"id":3805,"library":"silero-vad","title":"Silero Voice Activity Detector (VAD)","description":"Silero VAD is a state-of-the-art Voice Activity Detector (VAD) provided by Silero, built with PyTorch. It helps identify speech segments within audio, offering improved quality and performance across various languages and noisy environments. The current version is 6.2.1, and the library maintains an active release cadence with regular updates to models and features.","status":"active","version":"6.2.1","language":"en","source_language":"en","source_url":"https://github.com/snakers4/silero-vad","tags":["audio","voice activity detection","speech processing","machine learning","pytorch","onnx"],"install":[{"cmd":"pip install silero-vad","lang":"bash","label":"Core library"},{"cmd":"pip install onnxruntime","lang":"bash","label":"For CPU ONNX inference (optional)"},{"cmd":"pip install onnxruntime-gpu","lang":"bash","label":"For GPU ONNX inference (optional)"}],"dependencies":[{"reason":"Required for ONNX model inference. Optional since v6.2.1.","package":"onnxruntime","optional":true},{"reason":"Required for GPU ONNX model inference. Optional since v6.2.1.","package":"onnxruntime-gpu","optional":true}],"imports":[{"symbol":"torch","correct":"import torch"},{"symbol":"torchaudio","correct":"import torchaudio"},{"note":"Silero VAD models and utilities are primarily loaded via `torch.hub.load` from the GitHub repository, not direct package imports.","wrong":"from silero_vad import model, utils","symbol":"silero_vad model and utils","correct":"model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=True)"},{"note":"`get_speech_timestamps` is part of the `utils` tuple returned by `torch.hub.load`.","symbol":"get_speech_timestamps","correct":"from utils import get_speech_timestamps"},{"note":"`VADIterator` is part of the `utils` tuple returned by `torch.hub.load`.","symbol":"VADIterator","correct":"from utils import VADIterator"}],"quickstart":{"code":"import torch\nimport torchaudio\nimport numpy as np\n\n# Ensure PyTorch is installed and CUDA if available\nif not torch.cuda.is_available():\n    print(\"Warning: CUDA not available, using CPU for VAD.\")\n\n# Load the Silero VAD model and utilities from torch hub\n# force_reload=True ensures you get the latest version from the repo\n# onnx=True if you have onnxruntime installed and want to use ONNX model\nmodel, utils = torch.hub.load(\n    repo_or_dir='snakers4/silero-vad',\n    model='silero_vad',\n    force_reload=True,\n    onnx=False # Set to True if onnxruntime is installed and preferred\n)\n\n# Destructure utilities\n(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils\n\n# Define the sampling rate required by the model (e.g., 16000 Hz)\nSAMPLING_RATE = 16000\n\n# Create dummy audio for demonstration (10 seconds, 16kHz)\nsamples = SAMPLING_RATE * 10\ndummy_audio = torch.randn(samples, dtype=torch.float32)\n\n# Resample dummy audio to the required sampling rate if it's not already\n# (In a real scenario, you'd read your audio file with torchaudio.load)\nif SAMPLING_RATE != torchaudio.get_sample_rate(dummy_audio):\n    # This part is illustrative; dummy_audio is already at 16k here\n    # For real audio: audio, sr = torchaudio.load('your_audio.wav')\n    # if sr != SAMPLING_RATE: audio = torchaudio.functional.resample(audio, orig_freq=sr, new_freq=SAMPLING_RATE)\n    pass\n\n# Process the audio to get speech timestamps\nspeech_timestamps = get_speech_timestamps(dummy_audio, model, sampling_rate=SAMPLING_RATE)\n\nprint(f\"Speech timestamps detected: {speech_timestamps}\")\n\n# Example of using VADIterator for real-time processing (requires audio chunks)\nvad_iterator = VADIterator(model, sampling_rate=SAMPLING_RATE)\n# Simulate processing small chunks\nchunk_size = SAMPLING_RATE * 0.5 # 0.5 second chunks\nfor i in range(0, dummy_audio.shape[0], int(chunk_size)):\n    chunk = dummy_audio[i:i + int(chunk_size)]\n    if chunk.shape[0] < chunk_size: # Handle last chunk\n        continue\n    speech_dict = vad_iterator(chunk, return_seconds=True)\n    if speech_dict:\n        print(f\"Speech detected in chunk starting at {i/SAMPLING_RATE:.2f}s: {speech_dict}\")\n\nvad_iterator.reset_states() # Reset internal states after processing\n","lang":"python","description":"This quickstart demonstrates how to load the Silero VAD model and its associated utilities using `torch.hub.load`. It then generates dummy audio, processes it to detect speech segments using `get_speech_timestamps`, and also illustrates the use of `VADIterator` for processing audio in chunks, useful for real-time applications. Ensure PyTorch and Torchaudio are installed, and optionally `onnxruntime` if ONNX inference is desired."},"warnings":[{"fix":"Run `pip install onnxruntime` (for CPU) or `pip install onnxruntime-gpu` (for GPU) if you intend to use ONNX models. Otherwise, ensure `onnx=False` in your model loading.","message":"As of v6.2.1, `onnxruntime` is no longer a required dependency for `silero-vad`. If you plan to use the ONNX version of the models, you must explicitly install `onnxruntime` (or `onnxruntime-gpu`) yourself. Failing to do so will result in errors if `onnx=True` is passed to `torch.hub.load`.","severity":"breaking","affected_versions":">=6.2.1"},{"fix":"Evaluate the new model's performance on your specific datasets and adjust VAD parameters (e.g., `threshold`, `min_speech_duration_ms`, `min_silence_duration_ms`) as needed. Be aware of potential changes in output or edge case handling.","message":"Version 6.0 introduced a 'New v6 VAD' model with improved quality and a changed training algorithm. While generally better, this might mean that existing applications tuned for older models (v5, v4) could exhibit different behavior, require re-tuning parameters, or see changes in speech detection sensitivity.","severity":"breaking","affected_versions":">=6.0.0"},{"fix":"Update to the v5 model for improved performance and quality, but re-evaluate the impact on memory usage and confirm that detection behavior remains suitable for your application. If constrained by model size or specific legacy behavior, consider explicitly loading an older model version if available via `torch.hub.load`.","message":"Version 5.0 introduced significant changes, including a 3x faster inference, a 2x larger model size, and vastly improved quality supporting over 6000 languages. Applications relying on previous model versions (v4) for specific performance characteristics or model size might need to update their pipelines or resource estimations.","severity":"breaking","affected_versions":">=5.0.0"},{"fix":"Always use the `model, utils = torch.hub.load(...)` pattern as demonstrated in the official examples and quickstart. Then unpack the `utils` tuple to access the desired functions: `(get_speech_timestamps, ...) = utils`.","message":"The Silero VAD model and its core utilities (`get_speech_timestamps`, `VADIterator`, etc.) are primarily loaded using `torch.hub.load` directly from the `snakers4/silero-vad` GitHub repository. Attempting to import these functions directly from the installed `silero_vad` Python package (e.g., `from silero_vad.utils import get_speech_timestamps`) will likely fail or lead to unexpected behavior, as the package serves primarily as an installer/wrapper.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always resample your input audio to the model's expected `sampling_rate` (e.g., 16000 Hz) before passing it to the VAD functions. `torchaudio.functional.resample` can be used for this purpose.","message":"The VAD models expect audio to be at a specific sampling rate (most commonly 16kHz, though some older models supported 8kHz, and v4 supports both 8k/16k for ONNX). Providing audio with a mismatched sampling rate will lead to incorrect or degraded VAD performance without explicit errors.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}