{"id":2207,"library":"pyannote-audio","title":"pyannote.audio","description":"pyannote.audio is a state-of-the-art open-source toolkit for speaker diarization. It provides pre-trained deep learning models and pipelines for tasks like speaker recognition, voice activity detection, and speaker change detection. Currently at version 4.0.4, it actively integrates with the Hugging Face Hub for model distribution and offers robust audio processing capabilities. Releases are frequent for bug fixes and minor improvements, with major versions aligning with significant API or model architecture updates.","status":"active","version":"4.0.4","language":"en","source_language":"en","source_url":"https://github.com/pyannote/pyannote-audio","tags":["audio processing","speaker diarization","speech processing","machine learning","Hugging Face","deep learning"],"install":[{"cmd":"pip install pyannote.audio","lang":"bash","label":"Basic Installation"},{"cmd":"pip install pyannote.audio[onnx]","lang":"bash","label":"CPU-Optimized (ONNX Runtime)"},{"cmd":"pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121\npip install pyannote.audio","lang":"bash","label":"GPU Installation (CUDA 12.1 example)"}],"dependencies":[{"reason":"Deep learning framework for model inference.","package":"torch","optional":false},{"reason":"Required for downloading and authenticating pre-trained models from Hugging Face Hub.","package":"huggingface_hub","optional":false},{"reason":"Optional backend for faster CPU inference.","package":"onnxruntime","optional":true}],"imports":[{"note":"The primary Pipeline class is directly available at the top-level `pyannote.audio` package since v2.x. Accessing it from internal modules like `pyannote.audio.core.pipeline` is discouraged and might break in future versions.","wrong":"from pyannote.audio.core.pipeline import Pipeline","symbol":"Pipeline","correct":"from pyannote.audio import Pipeline"},{"note":"For loading individual models directly, e.g., for embedding or Voice Activity Detection.","symbol":"Model","correct":"from pyannote.audio import Model"},{"note":"The core data structure for storing diarization results is part of the `pyannote.core` library.","symbol":"Annotation","correct":"from pyannote.core import Annotation"},{"note":"Used for representing time segments in annotations and other audio processing tasks, also part of `pyannote.core`.","symbol":"Segment","correct":"from pyannote.core import Segment"}],"quickstart":{"code":"import os\nimport torchaudio\nimport torch\nimport numpy as np\nimport tempfile\nimport shutil\n\n# 1. Create a dummy audio file for demonstration\nduration_seconds = 5\nsample_rate = 16000\nt = np.linspace(0, duration_seconds, int(sample_rate * duration_seconds), endpoint=False)\naudio_data = 0.5 * np.sin(2 * np.pi * 440 * t).astype(np.float32)\n\ntemp_dir = tempfile.mkdtemp()\ndummy_audio_path = os.path.join(temp_dir, \"dummy_audio.wav\")\ntorchaudio.save(dummy_audio_path, torch.from_numpy(audio_data).unsqueeze(0), sample_rate)\n\n# 2. Authenticate with Hugging Face Hub\n# Get your Hugging Face token from https://huggingface.co/settings/tokens\n# and set it as an environment variable `HF_TOKEN` or replace the placeholder.\nhf_token = os.environ.get(\"HF_TOKEN\", \"hf_YOUR_HUGGING_FACE_TOKEN_HERE\")\n\nif hf_token == \"hf_YOUR_HUGGING_FACE_TOKEN_HERE\":\n    print(\"WARNING: Please obtain a Hugging Face token from https://huggingface.co/settings/tokens \")\n    print(\"and set the HF_TOKEN environment variable or replace the placeholder in the code.\")\n    print(\"Continuing with placeholder token; pipeline initialization might fail without proper authentication.\")\n\n# 3. Import and initialize the Pyannote.audio Pipeline\nfrom pyannote.audio import Pipeline\npipeline = Pipeline(\"pyannote/speaker-diarization-3.1\", use_auth_token=hf_token)\n\n# 4. Prepare the audio input\ndemo_file = {\"uri\": \"dummy_conversation\", \"audio\": dummy_audio_path}\n\n# 5. Run the speaker diarization\ndi_result = pipeline(demo_file)\n\n# 6. Print the diarization result\nprint(\"\\nDiarization Result:\")\nfor turn, _, speaker in di_result.itertracks(yield_label=True):\n    print(f\"start={turn.start:.1f}s stop={turn.end:.1f}s speaker={speaker}\")\n\n# 7. Clean up the dummy audio file\nshutil.rmtree(temp_dir)\nprint(f\"\\nCleaned up temporary audio directory: {temp_dir}\")","lang":"python","description":"This quickstart demonstrates how to set up `pyannote.audio`, authenticate with the Hugging Face Hub, and run a speaker diarization pipeline on a dummy audio file. It highlights the critical step of providing an authentication token for model access, a common requirement for `pyannote.audio` models since version 4.0."},"warnings":[{"fix":"Obtain a Hugging Face user access token (read role is sufficient) from `https://huggingface.co/settings/tokens`. Pass it via the `use_auth_token` argument to `Pipeline` or `Model` constructors, or log in using `huggingface-cli login`.","message":"As of `pyannote.audio` v4.x, all pre-trained models hosted on the Hugging Face Hub require an authentication token to be downloaded. This is a significant change from v3.x, where models could be downloaded without explicit authentication.","severity":"breaking","affected_versions":">=4.0.0"},{"fix":"Manually install the correct `torch` version for your CUDA toolkit (e.g., `pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121` for CUDA 12.1) *before* installing `pyannote.audio`.","message":"GPU acceleration with PyTorch requires a specific `torch` installation matching your CUDA version. `pyannote.audio` itself does not install GPU-enabled `torch` by default, leading to CPU-only inference if not correctly set up.","severity":"gotcha","affected_versions":"All"},{"fix":"Pre-process audio to mono, 16kHz before passing to the pipeline. `pyannote.audio` uses `torchaudio` for loading, which handles resampling and channel reduction internally but explicit pre-processing ensures consistency.","message":"Input audio files should ideally be mono, 16kHz sample rate, and in a commonly supported format (e.g., WAV). Issues may arise with uncommon codecs, multichannel audio, or significantly different sample rates, potentially leading to errors or suboptimal performance.","severity":"gotcha","affected_versions":"All"},{"fix":"Always pin to a specific model version (e.g., `\"pyannote/speaker-diarization-3.1\"`) when instantiating pipelines in production or for reproducible research. Check the Hugging Face model page for available versions.","message":"Model versions (e.g., `pyannote/speaker-diarization-3.1` vs `pyannote/speaker-diarization@main`) can have different performance characteristics, bug fixes, or even breaking changes. Relying on `@main` can lead to unexpected behavior.","severity":"gotcha","affected_versions":"All"},{"fix":"Install `pyannote.audio` with the ONNX extra: `pip install pyannote.audio[onnx]`. The pipeline will automatically try to use ONNX runtime if available and compatible.","message":"For CPU-only inference, the default PyTorch backend can be significantly slower than optimized runtimes like ONNX. This impacts processing time for long audio files or batch processing.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}