Vosk: Offline Speech Recognition API

0.3.45 · active · verified Thu Apr 16

Vosk is an offline, open-source speech recognition toolkit based on Kaldi. It provides Python bindings for performing speech-to-text conversion for over 20 languages and dialects, supporting continuous large vocabulary transcription. It is designed to run efficiently on various devices, including Raspberry Pi, and ensures privacy as audio data is processed locally. The current version is 0.3.45, with active development and frequent releases based on its GitHub activity.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to set up Vosk for transcribing a WAV audio file. It involves downloading a pre-trained language model, loading it into a `Model` object, initializing a `KaldiRecognizer` with the model and the audio's sample rate, and then feeding audio data in chunks for recognition. Ensure your audio file is 16kHz, 16-bit PCM, mono WAV format.

import os
import wave
from vosk import Model, KaldiRecognizer

# --- IMPORTANT: Download a Vosk model ---
# 1. Visit https://alphacephei.com/vosk/models
# 2. Download a small model (e.g., vosk-model-small-en-us-0.22.zip)
# 3. Unzip it into a directory. For this example, let's assume it's in a 'model' folder
#    adjacent to your script, e.g., 'your_project/model/vosk-model-small-en-us-0.22'

MODEL_PATH = "model/vosk-model-small-en-us-0.22"  # Adjust this path to your downloaded model
AUDIO_FILE = "test.wav" # Ensure you have a WAV file (16kHz, 16-bit PCM, mono)

if not os.path.exists(MODEL_PATH):
    print(f"Error: Vosk model not found at {MODEL_PATH}")
    print("Please download a model from https://alphacephei.com/vosk/models and unzip it into the specified path.")
    exit(1)

# Load the Vosk model
model = Model(MODEL_PATH)

# Initialize the KaldiRecognizer with the model and the audio sample rate
# The sample rate MUST match the audio file's sample rate (usually 16000 for Vosk models)
rec = KaldiRecognizer(model, 16000)

# Open the audio file
try:
    wf = wave.open(AUDIO_FILE, "rb")
    if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
        print("Audio file must be MONO, 16-bit PCM, uncompressed WAV.")
        print("Consider using ffmpeg to convert: ffmpeg -i input.mp3 -ar 16000 -ac 1 -acodec pcm_s16le output.wav")
        exit(1)
except wave.Error as e:
    print(f"Error opening audio file {AUDIO_FILE}: {e}")
    print("Please ensure the audio file exists and is a valid WAV.")
    exit(1)

# Process audio data in chunks
print("Transcribing...")
while True:
    data = wf.readframes(4000) # Read 4000 frames (approx. 0.25 seconds for 16kHz audio)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        result = rec.Result()
        print(result)

# Get final result for any remaining audio
final_result = rec.FinalResult()
print(final_result)

print("Transcription complete.")

view raw JSON →