{"id":4851,"library":"whisperx","title":"WhisperX","description":"WhisperX is a Python library that provides time-accurate Automatic Speech Recognition (ASR) using OpenAI's Whisper model, enhanced with speaker diarization. It supports a range of models, languages, and device configurations (CPU/GPU) to offer high-quality transcription with precise timestamps and speaker identification. The current version is 3.8.5, and it maintains an active release cadence with frequent updates.","status":"active","version":"3.8.5","language":"en","source_language":"en","source_url":"https://github.com/m-bain/whisperX","tags":["speech-to-text","audio-processing","ai","whisper","diarization","asr"],"install":[{"cmd":"pip install whisperx torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121","lang":"bash","label":"With CUDA (recommended for GPU)"},{"cmd":"pip install whisperx torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0","lang":"bash","label":"Without CUDA (for CPU only)"},{"cmd":"pip install requests","lang":"bash","label":"For Quickstart Example (downloading audio)"}],"dependencies":[{"reason":"Core deep learning framework for model execution (required for CPU or GPU). Specific CUDA versions for GPU are critical.","package":"torch","optional":false},{"reason":"Used in the quickstart example to download a sample audio file, not a core whisperx dependency.","package":"requests","optional":true}],"imports":[{"symbol":"load_model","correct":"import whisperx\nmodel = whisperx.load_model(...)"},{"symbol":"load_audio","correct":"import whisperx\naudio = whisperx.load_audio(...)"},{"symbol":"DiarizationPipeline","correct":"from whisperx import DiarizationPipeline\ndiarize_model = DiarizationPipeline(...)"},{"symbol":"load_align_model","correct":"import whisperx\nmodel_a, metadata = whisperx.load_align_model(...)"},{"symbol":"assign_speakers","correct":"import whisperx\nresult = whisperx.assign_speakers(...)"}],"quickstart":{"code":"import whisperx\nimport torch\nimport os\nfrom pathlib import Path\nimport requests\n\n# --- Setup for a runnable example ---\n# Path for the temporary audio file\ntemp_audio_path = Path(\"temp_whisperx_example.wav\")\n# A small WAV file from Mozilla DeepSpeech samples\nwav_url = \"https://github.com/mozilla/DeepSpeech/raw/master/samples/audio/8455-210777-0068.wav\"\n\n# Download a small audio file if it doesn't exist\nif not temp_audio_path.exists():\n    print(f\"Downloading sample audio from {wav_url}...\")\n    try:\n        response = requests.get(wav_url, stream=True)\n        response.raise_for_status()\n        with open(temp_audio_path, 'wb') as f:\n            for chunk in response.iter_content(chunk_size=8192):\n                f.write(chunk)\n        print(\"Sample audio downloaded.\")\n    except requests.exceptions.RequestException as e:\n        print(f\"Failed to download sample audio: {e}\")\n        print(\"Please ensure you have an internet connection or manually place a WAV file at temp_whisperx_example.wav\")\n        exit(1)\n\n# --- WhisperX Core Logic ---\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n# float16 for GPU, int8 for CPU or low VRAM GPU\ncompute_type = \"float16\" if device == \"cuda\" else \"int8\" \nbatch_size = 16 # Reduce if low on GPU VRAM\n\nprint(f\"\\nLoading WhisperX model ('base') on {device} with {compute_type} precision...\")\n# Using 'base' for faster download and less VRAM for quickstart\nmodel = whisperx.load_model(\"base\", device, compute_type=compute_type, language=\"en\")\n\nprint(f\"Loading audio from {temp_audio_path}...\")\naudio = whisperx.load_audio(str(temp_audio_path))\n\nprint(\"Transcribing audio...\")\nresult = model.transcribe(audio, batch_size=batch_size)\n\nprint(\"Loading alignment model and aligning segments...\")\nmodel_a, metadata = whisperx.load_align_model(language_code=result[\"language\"], device=device)\naligned_result = whisperx.align(result[\"segments\"], model_a, metadata, audio, device)\n\nprint(\"\\nTranscription Result (Aligned):\")\nfor segment in aligned_result[\"segments\"]:\n    print(f\"[{segment['start']:.2f}s - {segment['end']:.2f}s]: {segment['text']}\")\n\n# Optional: Add diarization for speaker assignment\n# Diarization models from Hugging Face may require an auth token.\n# Set your Hugging Face token as an environment variable (e.g., HF_TOKEN=\"hf_xxxx\")\nhf_token = os.environ.get(\"HF_TOKEN\", \"\") \n\nif hf_token:\n    print(\"\\nPerforming diarization (speaker assignment)...\")\n    # Diarization requires an internet connection to download models\n    diarize_model = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device)\n    diarize_segments = diarize_model(str(temp_audio_path), min_speakers=1, max_speakers=2)\n    result_with_speakers = whisperx.assign_speakers(diarize_segments, aligned_result)\n    \n    print(\"\\nTranscription with Speakers:\")\n    for segment in result_with_speakers[\"segments\"]:\n        print(f\"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment.get('speaker', 'UNKNOWN')}: {segment['text']}\")\nelse:\n    print(\"\\nSkipping diarization: HF_TOKEN environment variable not found. Diarization models may require it.\")\n\n# --- Cleanup ---\ntemp_audio_path.unlink()\nprint(\"\\nWhisperX quickstart completed.\")","lang":"python","description":"This quickstart demonstrates loading an ASR model, transcribing a sample audio file (downloaded automatically for runnability), and aligning the transcription. It also includes an optional step for speaker diarization, which requires a Hugging Face authentication token for model downloads. Ensure `requests` is installed (`pip install requests`) to run this example, and `ffmpeg` is installed on your system."},"warnings":[{"fix":"Install ffmpeg on your operating system (e.g., `sudo apt install ffmpeg` on Debian/Ubuntu, `brew install ffmpeg` on macOS, or download from ffmpeg.org for Windows).","message":"FFmpeg is a critical system dependency for WhisperX's audio processing. It must be installed separately (e.g., via apt, brew, or by downloading binaries) and available in your system's PATH. WhisperX will not function without it.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Follow PyTorch's installation instructions for your specific CUDA version (e.g., `pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121`). Then, set `device='cuda'` and `compute_type='float16'` in WhisperX.","message":"GPU (CUDA) setup with PyTorch can be complex. Ensure you install the correct `torch` version matching your CUDA toolkit version for optimal performance. Incorrect installation often leads to 'CUDA not available' errors or silently falling back to CPU.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Obtain a Hugging Face API token from huggingface.co/settings/tokens. Set it as an environment variable (e.g., `HF_TOKEN=hf_YOUR_TOKEN_HERE`) or pass it directly to `DiarizationPipeline(use_auth_token='hf_YOUR_TOKEN_HERE', ...)`.","message":"Speaker diarization models (e.g., from Hugging Face) often require an authentication token to download. If not provided, `DiarizationPipeline` might fail or throw an error about missing credentials.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Reduce `batch_size`, use a smaller model (e.g., 'base' or 'small'), or set `compute_type='int8'` to reduce memory footprint. For GPU, ensure `compute_type='float16'`.","message":"WhisperX models, especially 'large-v2', can consume significant GPU VRAM and system RAM. Running large models on devices with insufficient memory will result in Out-of-Memory (OOM) errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Refer to the latest WhisperX GitHub README or documentation for the most current `load_model` and API signatures. Update your code to match the new argument names and types. Ensure you are on Python >=3.10 and <3.14.","message":"The `whisperx.load_model` signature has changed across major versions. Older versions might not accept `compute_type` or expect different arguments for device configuration. The internal implementation of some functions might also be refined, affecting older custom workflows.","severity":"breaking","affected_versions":"< 3.0"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}