Vosk: Offline Speech Recognition API
Vosk is an offline, open-source speech recognition toolkit based on Kaldi. It provides Python bindings for performing speech-to-text conversion for over 20 languages and dialects, supporting continuous large vocabulary transcription. It is designed to run efficiently on various devices, including Raspberry Pi, and ensures privacy as audio data is processed locally. The current version is 0.3.45, with active development and frequent releases based on its GitHub activity.
Common errors
-
ModuleNotFoundError: No module named 'vosk'
cause The Vosk library is not installed in the currently active Python environment, or there are multiple Python installations causing conflicts.fixEnsure `vosk` is installed in your active Python environment: `pip install vosk`. If you have multiple Python versions, verify `pip show vosk` points to the correct installation. Using virtual environments (`venv` or `conda`) is highly recommended to isolate dependencies. -
ERROR (VoskAPI:Model():model.cc:122) Folder 'model' does not contain model files. Make sure you specified the model path properly in Model constructor. Exception: Failed to create a model.
cause The `vosk.Model()` constructor could not find the necessary model files in the specified directory. This often happens due to incorrect path, unzipped folder structure, or an incomplete model download.fixVerify that the path provided to `vosk.Model()` points directly to the *unzipped* model directory (e.g., `vosk-model-en-us-small-0.22`), not the zip file itself. Ensure all model subdirectories and files are present within this path. Use an absolute path for robustness. -
Exception: Failed to process waveform
cause The audio data being fed to `recognizer.AcceptWaveform()` does not match the expected format or sample rate of the loaded Vosk model. This is commonly due to a mismatch in sample rates between the audio source and the `KaldiRecognizer` initialization.fixCheck the sample rate of your audio file/stream and ensure it matches the `sample_rate` argument provided to `vosk.KaldiRecognizer()`. Additionally, confirm the audio is mono, 16-bit PCM, and uncompressed WAV. Convert the audio if necessary using `ffmpeg`. -
{ "text" : "" } (Vosk returns empty transcription)cause This usually indicates that Vosk is not detecting any speech or is unable to process the audio effectively. Common causes include incorrect audio format, extremely low volume, a model not suited for the language or accent, or an incorrect sample rate.fixDouble-check the audio format (mono, 16-bit PCM, 16kHz WAV is standard) and ensure the `KaldiRecognizer` is initialized with the correct sample rate. Verify the audio actually contains speech and is not too quiet. Try a different Vosk model for the target language. Ensure you are calling `rec.FinalResult()` at the end of processing to retrieve any pending transcription.
Warnings
- breaking Vosk version 0.3.30 introduced an API change regarding word times, making them optional. Code relying on previous implicit behavior might need adjustment.
- gotcha Incorrect audio format (e.g., stereo, wrong sample rate, compressed) is a common cause of poor recognition or the `Failed to process waveform` error. Vosk models typically expect mono, 16-bit PCM, uncompressed WAV audio, usually at 16kHz sample rate.
- gotcha Model files must be downloaded separately and their path correctly specified. Using relative paths can lead to errors if the script's execution directory changes, or if `model_name` argument is used incorrectly.
Install
-
pip install vosk -
pip install vosk pyaudio
Imports
- Model
from vosk import Model
- KaldiRecognizer
from vosk import KaldiRecognizer
- Vosk imports as vosk.*
from vosk_api import Model
import vosk
Quickstart
import os
import wave
from vosk import Model, KaldiRecognizer
# --- IMPORTANT: Download a Vosk model ---
# 1. Visit https://alphacephei.com/vosk/models
# 2. Download a small model (e.g., vosk-model-small-en-us-0.22.zip)
# 3. Unzip it into a directory. For this example, let's assume it's in a 'model' folder
# adjacent to your script, e.g., 'your_project/model/vosk-model-small-en-us-0.22'
MODEL_PATH = "model/vosk-model-small-en-us-0.22" # Adjust this path to your downloaded model
AUDIO_FILE = "test.wav" # Ensure you have a WAV file (16kHz, 16-bit PCM, mono)
if not os.path.exists(MODEL_PATH):
print(f"Error: Vosk model not found at {MODEL_PATH}")
print("Please download a model from https://alphacephei.com/vosk/models and unzip it into the specified path.")
exit(1)
# Load the Vosk model
model = Model(MODEL_PATH)
# Initialize the KaldiRecognizer with the model and the audio sample rate
# The sample rate MUST match the audio file's sample rate (usually 16000 for Vosk models)
rec = KaldiRecognizer(model, 16000)
# Open the audio file
try:
wf = wave.open(AUDIO_FILE, "rb")
if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
print("Audio file must be MONO, 16-bit PCM, uncompressed WAV.")
print("Consider using ffmpeg to convert: ffmpeg -i input.mp3 -ar 16000 -ac 1 -acodec pcm_s16le output.wav")
exit(1)
except wave.Error as e:
print(f"Error opening audio file {AUDIO_FILE}: {e}")
print("Please ensure the audio file exists and is a valid WAV.")
exit(1)
# Process audio data in chunks
print("Transcribing...")
while True:
data = wf.readframes(4000) # Read 4000 frames (approx. 0.25 seconds for 16kHz audio)
if len(data) == 0:
break
if rec.AcceptWaveform(data):
result = rec.Result()
print(result)
# Get final result for any remaining audio
final_result = rec.FinalResult()
print(final_result)
print("Transcription complete.")