Faster Whisper
Faster Whisper is a re-implementation of OpenAI's Whisper model using CTranslate2, which allows for faster inference and reduced memory usage. It is highly optimized for CPU and GPU, supporting various compute types. The current version is 1.2.1, with an active release cadence, frequently adding new features, model support, and performance improvements.
Warnings
- breaking Version 1.0.0 upgraded CTranslate2 to v4.0, which added support for CUDA 12. Users on older CUDA versions (e.g., CUDA 11.x) might face compatibility issues and need to downgrade CTranslate2 or use a compatible `faster-whisper` version.
- breaking In version 1.1.0, some Voice Activity Detection (VAD) parameters were renamed. However, this change was reverted in version 1.1.1. If you implemented VAD parameter tuning with v1.1.0, your code might break when upgrading to v1.1.1 or later due to the reversion to original names.
- gotcha Older versions (prior to 1.1.1) and certain VAD configurations could lead to high RAM usage and Out-Of-Memory (OOM) errors, particularly with longer audio files or larger batch sizes.
- gotcha When using batched inference, specific issues regarding `clip_timestamps` and the `<|nocaptions|>` token were fixed in version 1.2.1. In earlier versions, these features might not have behaved as expected in batched mode, potentially leading to incorrect timestamp merging or token suppression.
Install
-
pip install faster-whisper -
pip install faster-whisper[vad,audio]
Imports
- WhisperModel
from faster_whisper import WhisperModel
Quickstart
from faster_whisper import WhisperModel
import os
# Ensure you have an audio file named 'audio.mp3' in the current directory
# For example, download a short audio clip or record one.
# Example: https://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3
model_size = os.environ.get('WHISPER_MODEL_SIZE', 'tiny.en') # e.g., 'large-v3', 'medium', 'tiny.en'
# Run on CPU with INT8 compute type for general compatibility
# For GPU, change device='cuda' and compute_type='float16' if supported
model = WhisperModel(model_size, device='cpu', compute_type='int8')
# Transcribe the audio file
# Replace 'audio.mp3' with the path to your audio file
segments, info = model.transcribe("audio.mp3", beam_size=5)
print(f"Detected language '{info.language}' with probability {info.language_probability:.2f}")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")