Python Speech Features
python-speech-features is a Python library designed for extracting common speech features used in Automatic Speech Recognition (ASR). It provides functionalities to compute Mel-Frequency Cepstral Coefficients (MFCCs), filterbank energies, log filterbank energies, and spectral subband centroids. The current stable version on PyPI is 0.6, last released in 2017, with a slightly newer v0.6.1 tag on its GitHub repository from 2020. The project maintains a slow release cadence, but its core functionalities remain widely used for fundamental speech feature extraction.
Warnings
- gotcha The PyPI version (0.6, last updated Aug 2017) is older than the latest tag on GitHub (v0.6.1, Jan 2020). Users installing via `pip install python-speech-features` might not get the absolute latest code, which could have minor fixes or changes not yet reflected on PyPI.
- gotcha When integrating with other audio processing libraries like `librosa`, be aware of data type expectations. `scipy.io.wavfile.read` typically returns `int16` samples, while some libraries might expect `float32`. Additionally, MFCC computation methodologies can differ between libraries (e.g., `python-speech-features` uses Discrete Fourier Transform, `librosa` uses Short-Time Fourier Transform), leading to different output shapes or values for seemingly identical parameters.
- gotcha A common warning, `WARNING:root:frame length (X) is greater than FFT size`, can occur if the product of `winlen` (window length in seconds) and `samplerate` (frame length in samples) exceeds `nfft` (FFT size).
Install
-
pip install python-speech-features
Imports
- mfcc
from python_speech_features import mfcc
- fbank
from python_speech_features import fbank
- logfbank
from python_speech_features import logfbank
- ssc
from python_speech_features import ssc
Quickstart
import numpy as np
from scipy.io import wavfile
from python_speech_features import mfcc, logfbank
import os
# Create a dummy WAV file for demonstration
samplerate = 16000 # Hz
duration = 1 # seconds
f_hz = 440 # A4 note
t = np.linspace(0., duration, int(samplerate * duration))
signal = 0.5 * np.sin(2 * np.pi * f_hz * t)
# Scale to 16-bit integer for WAV file
wav_signal = (signal * 32767).astype(np.int16)
dummy_wav_filename = 'dummy_audio.wav'
wavfile.write(dummy_wav_filename, samplerate, wav_signal)
# Read the audio file
(rate, sig) = wavfile.read(dummy_wav_filename)
# Compute MFCC features
mfcc_feat = mfcc(sig, rate)
print(f"MFCC features shape: {mfcc_feat.shape}")
# Compute Log Filterbank energies
fbank_feat = logfbank(sig, rate)
print(f"Log Filterbank features shape: {fbank_feat.shape}")
# Clean up the dummy file
os.remove(dummy_wav_filename)