{"id":7643,"library":"pyvad","title":"pyvad","description":"Pyvad is a Python wrapper for the `py-webrtcvad` library, designed for trimming speech clips from audio. It provides a simplified interface for Voice Activity Detection (VAD) functionality, allowing users to identify and extract voiced segments from audio data. The current version is 0.2.0, released in July 2022, with an infrequent release cadence.","status":"active","version":"0.2.0","language":"en","source_language":"en","source_url":"https://github.com/F-Tag/python-vad","tags":["audio processing","voice activity detection","VAD","speech","webrtc"],"install":[{"cmd":"pip install pyvad","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required for numerical operations on audio data.","package":"numpy"},{"reason":"Used for audio processing, including resampling, within pyvad.","package":"librosa"},{"reason":"The core Voice Activity Detector that pyvad wraps.","package":"py-webrtcvad"}],"imports":[{"note":"The primary function for performing Voice Activity Detection.","symbol":"vad","correct":"from pyvad import vad"},{"note":"Function to trim silence from audio based on VAD results.","symbol":"trim","correct":"from pyvad import trim"}],"quickstart":{"code":"import numpy as np\nfrom pyvad import vad, trim\n\n# Simulate audio data (e.g., 1 second of speech, 1 second of silence)\nfs = 16000 # Sample rate in Hz (WebRTC VAD supported rate)\nduration_speech = 1.0 # seconds\nduration_silence = 1.0 # seconds\n\n# Generate a simple sine wave for 'speech'\nt = np.linspace(0, duration_speech, int(fs * duration_speech), endpoint=False)\nspeech_data = 0.5 * np.sin(2 * np.pi * 440 * t) # 440 Hz sine wave\n\n# Generate silence\nsilence_data = np.zeros(int(fs * duration_silence))\n\n# Combine speech and silence\naudio_data = np.concatenate((silence_data, speech_data, silence_data)).astype(np.float32)\n\nprint(f\"Audio data shape: {audio_data.shape}, Sample rate: {fs} Hz\")\n\n# Perform Voice Activity Detection\nvact = vad(audio_data, fs)\nprint(f\"Voice activity array shape: {vact.shape}\")\n# vact will contain 1s for voiced segments, 0s for unvoiced\n\n# Trim silence from the audio\ntrimmed_audio, (start_idx, end_idx) = trim(audio_data, fs)\nprint(f\"Trimmed audio shape: {trimmed_audio.shape}\")\nprint(f\"Original audio length: {len(audio_data) / fs:.2f}s\")\nprint(f\"Trimmed audio from {start_idx/fs:.2f}s to {end_idx/fs:.2f}s, total {len(trimmed_audio)/fs:.2f}s\")\n","lang":"python","description":"This quickstart demonstrates how to use `pyvad` to perform Voice Activity Detection and trim silence from a simulated audio clip. It generates a sample audio array with speech and silence segments, then applies the `vad` and `trim` functions. The `vad` function returns an array indicating voiced/unvoiced segments, and `trim` returns the audio with leading/trailing silence removed, along with the start and end indices of the speech."},"warnings":[{"fix":"Ensure your Python environment is 3.8 or 3.9. Update your code to use `hop_length` instead of `hoplength` and adjust calls to `trim` to expect `(start_index, end_index)` as the second return value instead of relying on `return_sec`. Review the `example.ipynb` on GitHub for the latest API usage.","message":"Pyvad v0.2.0 introduced significant breaking changes, specifically restricting Python version compatibility to 3.8 and 3.9 only. Earlier versions might support Python 3.6+ or even Python 2.x (for pre-0.1.0 versions). Additionally, the `hoplength` argument was renamed to `hop_length`, and the `trim` function's `return_sec` argument was removed, with `trim` now returning `(start_index, end_index)` directly.","severity":"breaking","affected_versions":">=0.2.0"},{"fix":"Always ensure your input audio data (`data` and `fs`) and specified VAD parameters (`fs_vad`, `hop_length`, `vad_mode`) adhere to these constraints. Resample your audio to one of the supported `fs_vad` rates if necessary, and normalize `float` data to the -1.0 to 1.0 range or ensure `int` data fits the 16-bit PCM range. Use `librosa.resample` for resampling.","message":"The underlying `webrtcvad` library, and by extension `pyvad`, has strict requirements for audio input parameters. The `fs_vad` (internal sampling frequency for VAD) must be 8000, 16000, 32000, or 48000 Hz, and `hop_length` (frame duration) must be 10, 20, or 30 milliseconds. Input `data` must be mono and scaled correctly: if `int`, between -32768 and 32767; if `float`, between -1.0 and 1.0. Failure to meet these requirements will result in `ValueError` exceptions.","severity":"gotcha","affected_versions":"All"},{"fix":"When calling `vad` or `trim`, set `vad_mode` to 0, 1, 2, or 3 based on your desired aggressiveness. Default is 0.","message":"The `vad_mode` parameter, controlling aggressiveness, must be an integer between 0 and 3. A higher value (e.g., 3) makes the VAD more aggressive in filtering out non-speech, while a lower value (e.g., 0) is less aggressive. Using a value outside this range will raise a `ValueError`.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Change the `fs_vad` argument to 8000, 16000, 32000, or 48000. For example, `vad(data, fs, fs_vad=16000)`. You might need to resample your audio `data` to match `fs_vad` using `librosa.resample` if `fs` is different.","cause":"The `fs_vad` parameter provided to the `vad` or `trim` function is not one of the supported sampling frequencies for the WebRTC VAD.","error":"ValueError: fs_vad must be 8000, 16000, 32000 or 48000."},{"fix":"Set the `hop_length` argument to 10, 20, or 30. For example, `vad(data, fs, hop_length=30)`.","cause":"The `hop_length` parameter (frame duration in milliseconds) is not one of the allowed values.","error":"ValueError: hop_length must be 10, 20, or 30."},{"fix":"Normalize your `float` audio data by dividing it by its maximum absolute value, or ensure `int` data is correctly scaled. For example, `audio_data = audio_data / np.max(np.abs(audio_data))` for float data.","cause":"The input audio `data` is not normalized correctly for its data type. `float` data must be between -1.0 and 1.0, and `int` data must fit within a 16-bit PCM range (-32768 to 32767).","error":"ValueError: When data.type is float, data must be -1.0 <= data <= 1.0. (or similar for int data type)"},{"fix":"First, ensure your Python version is 3.8 or 3.9 as required by `pyvad` 0.2.0. Then, try reinstalling `pyvad` and its dependencies with `pip install --upgrade --no-cache-dir pyvad py-webrtcvad numpy librosa`. Check the installation logs for any compilation errors, particularly for `py-webrtcvad`.","cause":"This error can occur if pyvad or one of its dependencies (`py-webrtcvad`) failed to install correctly, or if there's a Python version incompatibility, especially with older Python versions not supported by pyvad 0.2.0.","error":"ImportError: cannot import name 'vad' from 'pyvad' (.../pyvad/__init__.py)"}]}