SpeechBrain
SpeechBrain is an open-source, all-in-one speech toolkit built in pure Python and PyTorch. It facilitates research and development of neural speech processing systems, offering a wide range of models for tasks like ASR, VAD, Speaker Recognition, Voice Enhancement, and more. The current version is 1.1.0, with releases typically tied to research milestones and new model introductions.
Warnings
- breaking SpeechBrain 1.0.0 introduced significant breaking changes, especially in the training recipes, data pipeline (e.g., `BrainDataset` moved), and distributed training (`run_on_main` was introduced). Many modules were renamed or refactored.
- gotcha Pretrained models downloaded via `from_hparams` create local directories (`savedir`) which can consume significant disk space (multiple GBs per model). These are not automatically cleaned up.
- gotcha SpeechBrain models often expect specific audio formats, typically 16kHz sample rate and single-channel (mono) audio. Providing audio with different sample rates or multiple channels without proper resampling/downmixing can lead to errors or poor model performance.
- deprecated Older, more granular inference modules like `speechbrain.inference.VAD` or `speechbrain.inference.ASR` are still available but the `speechbrain.pretrained` module is the recommended and more unified interface for inference with pretrained models.
Install
-
pip install speechbrain -
pip install speechbrain torchaudio
Imports
- EncoderDecoderASR
from speechbrain.pretrained import EncoderDecoderASR
- SpeakerRecognition
from speechbrain.pretrained import SpeakerRecognition
- VAD
from speechbrain.pretrained import VAD
- BrainDataset
from speechbrain.dataio.dataset.dynamic import BrainDataset
Quickstart
import torchaudio
import torch
import os
import shutil
from speechbrain.pretrained import EncoderDecoderASR
# Ensure a temporary directory for model downloads
savedir = "tmpdir_asr_quickstart"
# Initialize ASR model
try:
asr_model = EncoderDecoderASR.from_hparams(
source="speechbrain/asr-crdnn-rnnlm-librispeech",
savedir=savedir
)
# Create a dummy audio tensor (batch_size, samples)
# SpeechBrain models typically expect single-channel, 16kHz audio.
sample_rate = 16000
duration_seconds = 3
# Generate a random tensor mimicking a short audio clip
dummy_audio = torch.randn(1, sample_rate * duration_seconds)
# Perform ASR
transcription = asr_model.transcribe_batch(dummy_audio)
print(f"Transcription: {transcription}")
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Clean up the downloaded model directory
if os.path.exists(savedir):
shutil.rmtree(savedir, ignore_errors=True)
print(f"Cleaned up temporary directory: {savedir}")