{"id":6946,"library":"whisper-normalizer","title":"Whisper Normalizer","description":"Whisper Normalizer (version 0.1.12) is a Python package that implements the text standardization and normalization approach used in OpenAI's Whisper ASR model. It provides robust text normalization functionalities, crucial for evaluating Automatic Speech Recognition (ASR) systems by reducing unintentional penalties in metrics like WER and CER. The library extends beyond basic English normalization to include specialized normalizers for Indic languages, addressing challenges such as diacritic preservation. The project appears to have an active development and release cadence.","status":"active","version":"0.1.12","language":"en","source_language":"en","source_url":"https://github.com/kurianbenoy/whisper_normalizer","tags":["text processing","nlp","whisper","normalization","ASR","indic languages"],"install":[{"cmd":"pip install whisper-normalizer","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Used for number normalization in Indic languages, replacing a previous implementation.","package":"indic-numtowords","optional":false},{"reason":"Provides core normalization logic for Indic languages.","package":"indic-nlp-library","optional":false}],"imports":[{"symbol":"EnglishTextNormalizer","correct":"from whisper_normalizer.english import EnglishTextNormalizer"},{"symbol":"BasicTextNormalizer","correct":"from whisper_normalizer.basic import BasicTextNormalizer"},{"note":"BasicTextNormalizer can remove crucial diacritics and change meaning in Indic languages. Use language-specific normalizers like MalayalamNormalizer for correct results.","wrong":"from whisper_normalizer.basic import BasicTextNormalizer","symbol":"MalayalamNormalizer","correct":"from whisper_normalizer.indic_normalizer import MalayalamNormalizer"}],"quickstart":{"code":"from whisper_normalizer.english import EnglishTextNormalizer\nfrom whisper_normalizer.basic import BasicTextNormalizer\nfrom whisper_normalizer.indic_normalizer import MalayalamNormalizer\n\n# English Text Normalization\nenglish_normalizer = EnglishTextNormalizer()\ntext_en = \"I'm a little teapot, short & stout. Tip me over and pour me out! $20 million.\"\nnormalized_en = english_normalizer(text_en)\nprint(f\"English (input): {text_en}\")\nprint(f\"English (output): {normalized_en}\\n\")\n\n# Basic Text Normalization (general purpose, but use with caution for Indic languages)\nbasic_normalizer = BasicTextNormalizer()\ntext_basic = \"Hello [music] world (coughs). café résumé naïve.\"\nnormalized_basic = basic_normalizer(text_basic)\nprint(f\"Basic (input): {text_basic}\")\nprint(f\"Basic (output): {normalized_basic}\\n\")\n\n# Malayalam Text Normalization (example for Indic languages)\nmalayalam_normalizer = MalayalamNormalizer()\ntext_ml = \"എന്റെ കമ്പ്യൂട്ടറിനു് എന്റെ ഭാഷ.\"\nnormalized_ml = malayalam_normalizer(text_ml)\nprint(f\"Malayalam (input): {text_ml}\")\nprint(f\"Malayalam (output): {normalized_ml}\")","lang":"python","description":"Demonstrates the use of EnglishTextNormalizer, BasicTextNormalizer, and a specific Indic normalizer (MalayalamNormalizer) to process and normalize text. This highlights the different normalization strategies available for various linguistic contexts."},"warnings":[{"fix":"Always import and use the appropriate language-specific normalizer from `whisper_normalizer.indic_normalizer` (e.g., `MalayalamNormalizer`, `PunjabiNormalizer`) for Indic languages.","message":"Using `BasicTextNormalizer` or `EnglishTextNormalizer` for Indic languages can lead to unintended loss of crucial diacritics and semantic changes. These normalizers are optimized for English and general text processing, not for the unique characteristics of Indic scripts.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Users upgrading from versions prior to 0.1.4 should carefully review normalization outputs in their applications, especially for number-heavy text, to ensure compatibility with the new implementations.","message":"Between versions 0.1.0 and 0.1.4, the library significantly changed its number normalization backend for Indic languages by switching to the `AI4Bharat IndicNumtowords` package and removed network calls from `EnglishTextNormalizer`. This could alter previously expected normalization outputs, particularly for numbers, currencies, and certain English text patterns.","severity":"gotcha","affected_versions":"Prior to 0.1.4"},{"fix":"For standalone text normalization tasks, ensure you are importing from `whisper_normalizer`. If integrating with the full OpenAI Whisper ASR model, use its internal `whisper.normalizers`.","message":"Do not confuse `whisper-normalizer` (this standalone library) with the internal `whisper.normalizers` submodule found within OpenAI's main `openai-whisper` package. This library is designed for standalone text normalization, while `openai-whisper`'s normalizers are integrated directly into its ASR model pipeline and have a different import path (e.g., `from whisper.normalizers import BasicTextNormalizer`).","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}