OpenAI Whisper
OpenAI Whisper is a general-purpose automatic speech recognition (ASR) model, developed by OpenAI. It is trained on a large dataset of diverse audio and is capable of multilingual speech recognition, speech translation, and language identification. Releases are somewhat irregular, with multiple updates typically published each year, often in dated version formats (e.g., YYYYMMDD).
Warnings
- gotcha FFmpeg is a critical system-level dependency for `openai-whisper` to process audio files. The Python package installation does not include FFmpeg itself. You must install it separately using your operating system's package manager (e.g., `sudo apt install ffmpeg` on Debian/Ubuntu, `brew install ffmpeg` on macOS).
- breaking The `openai-whisper` library (this package) is distinct from the Whisper API offered by OpenAI (which uses the `openai` Python client library). The APIs and usage patterns are different. This registry entry pertains to the open-source `openai-whisper` library for local model execution.
- gotcha Whisper models, especially larger ones ('medium', 'large'), require significant CPU RAM and/or GPU VRAM. The `large` model can require 10GB or more of VRAM for inference, and even more RAM in addition, making it challenging for systems without powerful GPUs. Running multiple instances in parallel requires proportionally more resources.
- gotcha Installation issues can occur, particularly if the `tiktoken` dependency fails to build. `tiktoken` requires a Rust compiler and associated build tools on your system. For Windows, this often means installing Microsoft Visual C++ Build Tools.
- gotcha Whisper models can sometimes 'hallucinate' or produce irrelevant transcriptions, especially with silent audio segments, noisy input, or ambiguous speech. They may also struggle with specific jargon or heavy accents.
- gotcha The `turbo` model, while fast, is primarily optimized for English transcription and is not designed for translation tasks. Using `--task translate` with the `turbo` model will not yield translation results; it will return the original language.
Install
-
pip install -U openai-whisper
Imports
- whisper
import whisper
Quickstart
import whisper
import os
# Ensure you have an audio file, e.g., 'audio.mp3'
# For demonstration, let's create a dummy file if it doesn't exist
dummy_audio_path = 'dummy_audio.mp3'
if not os.path.exists(dummy_audio_path):
# This is just a placeholder, in a real scenario you'd use a valid audio file
print(f"Please ensure a valid audio file named '{dummy_audio_path}' exists for transcription.")
# Example: you might download a small audio file here
# For real use, replace this with your actual audio file.
# For this quickstart, we'll assume it exists or the user provides one.
# To make it runnable for testing, let's create a minimal WAV (requires scipy)
try:
from scipy.io.wavfile import write
import numpy as np
samplerate = 16000 # 16 kHz
duration = 1.0 # 1 second
frequency = 440 # A4 note
t = np.linspace(0., duration, int(samplerate * duration), endpoint=False)
amplitude = np.iinfo(np.int16).max * 0.5
data = amplitude * np.sin(2. * np.pi * frequency * t)
write(dummy_audio_path, samplerate, data.astype(np.int16))
print(f"Created a dummy audio file: {dummy_audio_path}")
except ImportError:
print("scipy not found. Cannot create dummy audio. Please provide your own audio.mp3.")
exit()
# Load a Whisper model (e.g., 'base', 'small', 'medium', 'large')
# 'tiny' or 'base' are good for quick tests, 'large' for best accuracy.
# The model will be downloaded on first use.
print("Loading Whisper model...")
model = whisper.load_model("base") # You can choose 'tiny', 'base', 'small', 'medium', 'large'
# Transcribe the audio file
print(f"Transcribing {dummy_audio_path}...")
result = model.transcribe(dummy_audio_path)
# Print the transcription
print("Transcription:")
print(result["text"])