pyannote.audio
pyannote.audio is a state-of-the-art open-source toolkit for speaker diarization. It provides pre-trained deep learning models and pipelines for tasks like speaker recognition, voice activity detection, and speaker change detection. Currently at version 4.0.4, it actively integrates with the Hugging Face Hub for model distribution and offers robust audio processing capabilities. Releases are frequent for bug fixes and minor improvements, with major versions aligning with significant API or model architecture updates.
Warnings
- breaking As of `pyannote.audio` v4.x, all pre-trained models hosted on the Hugging Face Hub require an authentication token to be downloaded. This is a significant change from v3.x, where models could be downloaded without explicit authentication.
- gotcha GPU acceleration with PyTorch requires a specific `torch` installation matching your CUDA version. `pyannote.audio` itself does not install GPU-enabled `torch` by default, leading to CPU-only inference if not correctly set up.
- gotcha Input audio files should ideally be mono, 16kHz sample rate, and in a commonly supported format (e.g., WAV). Issues may arise with uncommon codecs, multichannel audio, or significantly different sample rates, potentially leading to errors or suboptimal performance.
- gotcha Model versions (e.g., `pyannote/speaker-diarization-3.1` vs `pyannote/speaker-diarization@main`) can have different performance characteristics, bug fixes, or even breaking changes. Relying on `@main` can lead to unexpected behavior.
- gotcha For CPU-only inference, the default PyTorch backend can be significantly slower than optimized runtimes like ONNX. This impacts processing time for long audio files or batch processing.
Install
-
pip install pyannote.audio -
pip install pyannote.audio[onnx] -
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install pyannote.audio
Imports
- Pipeline
from pyannote.audio import Pipeline
- Model
from pyannote.audio import Model
- Annotation
from pyannote.core import Annotation
- Segment
from pyannote.core import Segment
Quickstart
import os
import torchaudio
import torch
import numpy as np
import tempfile
import shutil
# 1. Create a dummy audio file for demonstration
duration_seconds = 5
sample_rate = 16000
t = np.linspace(0, duration_seconds, int(sample_rate * duration_seconds), endpoint=False)
audio_data = 0.5 * np.sin(2 * np.pi * 440 * t).astype(np.float32)
temp_dir = tempfile.mkdtemp()
dummy_audio_path = os.path.join(temp_dir, "dummy_audio.wav")
torchaudio.save(dummy_audio_path, torch.from_numpy(audio_data).unsqueeze(0), sample_rate)
# 2. Authenticate with Hugging Face Hub
# Get your Hugging Face token from https://huggingface.co/settings/tokens
# and set it as an environment variable `HF_TOKEN` or replace the placeholder.
hf_token = os.environ.get("HF_TOKEN", "hf_YOUR_HUGGING_FACE_TOKEN_HERE")
if hf_token == "hf_YOUR_HUGGING_FACE_TOKEN_HERE":
print("WARNING: Please obtain a Hugging Face token from https://huggingface.co/settings/tokens ")
print("and set the HF_TOKEN environment variable or replace the placeholder in the code.")
print("Continuing with placeholder token; pipeline initialization might fail without proper authentication.")
# 3. Import and initialize the Pyannote.audio Pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline("pyannote/speaker-diarization-3.1", use_auth_token=hf_token)
# 4. Prepare the audio input
demo_file = {"uri": "dummy_conversation", "audio": dummy_audio_path}
# 5. Run the speaker diarization
di_result = pipeline(demo_file)
# 6. Print the diarization result
print("\nDiarization Result:")
for turn, _, speaker in di_result.itertracks(yield_label=True):
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker={speaker}")
# 7. Clean up the dummy audio file
shutil.rmtree(temp_dir)
print(f"\nCleaned up temporary audio directory: {temp_dir}")