pyctcdecode
pyctcdecode is a Python library that provides a standalone beam search decoder for CTC (Connectionist Temporal Classification) models. It allows for efficient decoding of CTC outputs and seamlessly integrates with KenLM language models to improve speech recognition accuracy. The current version is 0.5.0, and it follows an active release cadence, with updates addressing features, performance, and bug fixes.
Common errors
-
ModuleNotFoundError: No module named 'kenlm'
cause The `pyctcdecode[kenlm]` extra was not installed, or its installation failed during compilation, preventing the import of the `kenlm` module required for language model functionality.fixInstall with KenLM support using `pip install pyctcdecode[kenlm]`. If this fails, review system dependencies for KenLM (Boost, Zlib, Bzip2, CMake) as described in the warnings. -
error: Boost library not found or not configured. Set BOOST_ROOT or BOOST_INCLUDEDIR to point to the Boost install.
cause During `pyctcdecode[kenlm]` installation, the underlying KenLM C++ compilation process failed to locate the necessary Boost library, which is a critical dependency.fixInstall Boost development headers on your system (e.g., `sudo apt-get install libboost-all-dev` on Debian/Ubuntu, or `brew install boost` on macOS). Ensure `cmake` is also installed. If Boost is in a non-standard location, you might need to manually set environment variables like `BOOST_ROOT` or `BOOST_INCLUDEDIR`. -
IndexError: list index out of range
cause This often occurs when the `alphabet` passed to the decoder does not correctly map to the CTC model's output dimension, or more commonly, if the blank token is not positioned as the very first element in the alphabet list.fixVerify that your `labels` list used to create the `alphabet` has the `BLANK_TOKEN` (or an empty string) as its first element, and that the total length of your alphabet exactly matches the last dimension of your CTC model's output logits.
Warnings
- gotcha The KenLM C++ library, an optional but highly recommended dependency for language model integration, can be challenging to install due to its native compilation requirements (e.g., Boost, Zlib, Bzip2, CMake).
- gotcha The CTC blank token MUST be the first element in the alphabet list provided to `get_alphabet` or `Alphabet` constructor. Incorrect positioning will lead to incorrect decoding results or runtime errors.
- gotcha Loading a KenLM language model requires a valid `.arpa` file. Incorrect file paths, malformed `.arpa` files, or extremely large language models can cause `FileNotFoundError`, `MemoryError`, or excessively slow initialization.
Install
-
pip install pyctcdecode -
pip install pyctcdecode[kenlm]
Imports
- BeamSearchDecoderCTC
from pyctcdecode import BeamSearchDecoderCTC
- LanguageModel
from pyctcdecode import LanguageModel
- Alphabet
from pyctcdecode.alphabet import Alphabet
- BLANK_TOKEN
from pyctcdecode.alphabet import BLANK_TOKEN
- get_alphabet
from pyctcdecode.alphabet import get_alphabet
Quickstart
from pyctcdecode import BeamSearchDecoderCTC
from pyctcdecode.alphabet import BLANK_TOKEN, get_alphabet
import numpy as np
# Define your model's alphabet. The BLANK_TOKEN must be the first element.
# This example uses a common alphabet for English speech recognition.
labels = [BLANK_TOKEN] + list("abcdefghijklmnopqrstuvwxyz '")
alphabet = get_alphabet(labels)
# Create dummy CTC output (logits) for demonstration.
# In a real scenario, these would come from your deep learning model.
# Shape: (time_steps, alphabet_size)
time_steps = 50
logits = np.random.rand(time_steps, len(labels)).astype(np.float32)
# Initialize the decoder without a language model.
# For better accuracy, integrate with a KenLM language model (see warnings).
decoder = BeamSearchDecoderCTC(alphabet)
# Decode the logits. The decode method returns a list of hypotheses.
# We take the first (most probable) one.
hypotheses = decoder.decode(logits)
decoded_text = hypotheses[0]
print(f"Decoded text (example): {decoded_text}")