WhisperX
WhisperX is a Python library that provides time-accurate Automatic Speech Recognition (ASR) using OpenAI's Whisper model, enhanced with speaker diarization. It supports a range of models, languages, and device configurations (CPU/GPU) to offer high-quality transcription with precise timestamps and speaker identification. The current version is 3.8.5, and it maintains an active release cadence with frequent updates.
Warnings
- gotcha FFmpeg is a critical system dependency for WhisperX's audio processing. It must be installed separately (e.g., via apt, brew, or by downloading binaries) and available in your system's PATH. WhisperX will not function without it.
- gotcha GPU (CUDA) setup with PyTorch can be complex. Ensure you install the correct `torch` version matching your CUDA toolkit version for optimal performance. Incorrect installation often leads to 'CUDA not available' errors or silently falling back to CPU.
- gotcha Speaker diarization models (e.g., from Hugging Face) often require an authentication token to download. If not provided, `DiarizationPipeline` might fail or throw an error about missing credentials.
- gotcha WhisperX models, especially 'large-v2', can consume significant GPU VRAM and system RAM. Running large models on devices with insufficient memory will result in Out-of-Memory (OOM) errors.
- breaking The `whisperx.load_model` signature has changed across major versions. Older versions might not accept `compute_type` or expect different arguments for device configuration. The internal implementation of some functions might also be refined, affecting older custom workflows.
Install
-
pip install whisperx torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121 -
pip install whisperx torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 -
pip install requests
Imports
- load_model
import whisperx model = whisperx.load_model(...)
- load_audio
import whisperx audio = whisperx.load_audio(...)
- DiarizationPipeline
from whisperx import DiarizationPipeline diarize_model = DiarizationPipeline(...)
- load_align_model
import whisperx model_a, metadata = whisperx.load_align_model(...)
- assign_speakers
import whisperx result = whisperx.assign_speakers(...)
Quickstart
import whisperx
import torch
import os
from pathlib import Path
import requests
# --- Setup for a runnable example ---
# Path for the temporary audio file
temp_audio_path = Path("temp_whisperx_example.wav")
# A small WAV file from Mozilla DeepSpeech samples
wav_url = "https://github.com/mozilla/DeepSpeech/raw/master/samples/audio/8455-210777-0068.wav"
# Download a small audio file if it doesn't exist
if not temp_audio_path.exists():
print(f"Downloading sample audio from {wav_url}...")
try:
response = requests.get(wav_url, stream=True)
response.raise_for_status()
with open(temp_audio_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print("Sample audio downloaded.")
except requests.exceptions.RequestException as e:
print(f"Failed to download sample audio: {e}")
print("Please ensure you have an internet connection or manually place a WAV file at temp_whisperx_example.wav")
exit(1)
# --- WhisperX Core Logic ---
device = "cuda" if torch.cuda.is_available() else "cpu"
# float16 for GPU, int8 for CPU or low VRAM GPU
compute_type = "float16" if device == "cuda" else "int8"
batch_size = 16 # Reduce if low on GPU VRAM
print(f"\nLoading WhisperX model ('base') on {device} with {compute_type} precision...")
# Using 'base' for faster download and less VRAM for quickstart
model = whisperx.load_model("base", device, compute_type=compute_type, language="en")
print(f"Loading audio from {temp_audio_path}...")
audio = whisperx.load_audio(str(temp_audio_path))
print("Transcribing audio...")
result = model.transcribe(audio, batch_size=batch_size)
print("Loading alignment model and aligning segments...")
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
aligned_result = whisperx.align(result["segments"], model_a, metadata, audio, device)
print("\nTranscription Result (Aligned):")
for segment in aligned_result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s]: {segment['text']}")
# Optional: Add diarization for speaker assignment
# Diarization models from Hugging Face may require an auth token.
# Set your Hugging Face token as an environment variable (e.g., HF_TOKEN="hf_xxxx")
hf_token = os.environ.get("HF_TOKEN", "")
if hf_token:
print("\nPerforming diarization (speaker assignment)...")
# Diarization requires an internet connection to download models
diarize_model = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device)
diarize_segments = diarize_model(str(temp_audio_path), min_speakers=1, max_speakers=2)
result_with_speakers = whisperx.assign_speakers(diarize_segments, aligned_result)
print("\nTranscription with Speakers:")
for segment in result_with_speakers["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment.get('speaker', 'UNKNOWN')}: {segment['text']}")
else:
print("\nSkipping diarization: HF_TOKEN environment variable not found. Diarization models may require it.")
# --- Cleanup ---
temp_audio_path.unlink()
print("\nWhisperX quickstart completed.")