Whisper Normalizer

0.1.12 · active · verified Wed Apr 15

Whisper Normalizer (version 0.1.12) is a Python package that implements the text standardization and normalization approach used in OpenAI's Whisper ASR model. It provides robust text normalization functionalities, crucial for evaluating Automatic Speech Recognition (ASR) systems by reducing unintentional penalties in metrics like WER and CER. The library extends beyond basic English normalization to include specialized normalizers for Indic languages, addressing challenges such as diacritic preservation. The project appears to have an active development and release cadence.

Warnings

Install

Imports

Quickstart

Demonstrates the use of EnglishTextNormalizer, BasicTextNormalizer, and a specific Indic normalizer (MalayalamNormalizer) to process and normalize text. This highlights the different normalization strategies available for various linguistic contexts.

from whisper_normalizer.english import EnglishTextNormalizer
from whisper_normalizer.basic import BasicTextNormalizer
from whisper_normalizer.indic_normalizer import MalayalamNormalizer

# English Text Normalization
english_normalizer = EnglishTextNormalizer()
text_en = "I'm a little teapot, short & stout. Tip me over and pour me out! $20 million."
normalized_en = english_normalizer(text_en)
print(f"English (input): {text_en}")
print(f"English (output): {normalized_en}\n")

# Basic Text Normalization (general purpose, but use with caution for Indic languages)
basic_normalizer = BasicTextNormalizer()
text_basic = "Hello [music] world (coughs). café résumé naïve."
normalized_basic = basic_normalizer(text_basic)
print(f"Basic (input): {text_basic}")
print(f"Basic (output): {normalized_basic}\n")

# Malayalam Text Normalization (example for Indic languages)
malayalam_normalizer = MalayalamNormalizer()
text_ml = "എന്റെ കമ്പ്യൂട്ടറിനു് എന്റെ ഭാഷ."
normalized_ml = malayalam_normalizer(text_ml)
print(f"Malayalam (input): {text_ml}")
print(f"Malayalam (output): {normalized_ml}")

view raw JSON →