WeTextProcessing
WeTextProcessing is an active Python library providing production-ready Text Normalization (TN) and Inverse Text Normalization (ITN) capabilities. It primarily supports Chinese, English, and Japanese languages, leveraging Finite State Transducers (FSTs) for efficient processing. The library has a consistent release cadence, with multiple minor updates released throughout 2024 to introduce new features, improvements, and bug fixes.
Common errors
-
ModuleNotFoundError: No module named 'tn'
cause Attempting to import `Normalizer` or `InverseNormalizer` directly from the top-level `wetextprocessing` package instead of the specific `tn.<lang>` or `itn.<lang>` submodules.fixUse explicit, language-specific imports. For example, `from tn.chinese.normalizer import Normalizer` for Chinese Text Normalization, or `from tn.english.normalizer import Normalizer as EnNormalizer` for English Text Normalization. -
Failed to build wheel for pynini / ERROR: Could not build wheels for pynini which use PEP 517 and cannot be installed directly
cause The `pynini` dependency requires specific compilation steps and is primarily supported on Linux and macOS. This error typically occurs on Windows or other unsupported platforms during `pip install WeTextProcessing`.fixInstall WeTextProcessing within a Linux environment (e.g., WSL on Windows) or ensure a pre-compiled `pynini` wheel compatible with your system and Python version is available and installed before installing WeTextProcessing. `conda install -c conda-forge pynini` is often recommended for Conda users. -
TypeError: Normalizer() got an unexpected keyword argument 'remove_erhua'
cause You are likely trying to pass a language-specific parameter (like `remove_erhua` for Chinese) to a `Normalizer` instance that does not support it (e.g., an English normalizer, or a generic `wetext` normalizer if that package was used).fixEnsure you are using the correct language-specific Normalizer. For `remove_erhua`, you must use `from tn.chinese.normalizer import Normalizer`. For English, there are different or fewer configurable options. Consult the documentation for available parameters for each language's normalizer.
Warnings
- breaking Version 1.0.0 introduced significant changes to English Text Normalization rules, simplifying them compared to NeMo. While resulting in smaller FST sizes and faster build times, existing English TN implementations might require review and adjustment.
- gotcha The `pynini` dependency, fundamental to WeTextProcessing, is primarily designed for Linux and macOS environments. Direct installation on Windows is not straightforward and often leads to errors. While there's a separate `wetext` package that doesn't depend on Pynini, `WeTextProcessing` requires it.
- gotcha If you modify any parameters when initializing a `Normalizer` or `InverseNormalizer` (e.g., `remove_erhua`, `enable_0_to_9`), you must set `overwrite_cache=True` for the changes to take effect and for the underlying FSTs to be rebuilt. Failing to do so will result in the model reusing cached rules, ignoring your parameter changes.
- gotcha Starting from version 1.0.1, the global logging configuration was disabled within the library to prevent it from overwriting logging levels of other programs in the same environment. If your application relies on WeTextProcessing configuring logging globally, this behavior has changed.
Install
-
pip install WeTextProcessing
Imports
- Normalizer (Chinese TN)
from wetextprocessing import Normalizer
from tn.chinese.normalizer import Normalizer
- InverseNormalizer (Chinese ITN)
from wetextprocessing import InverseNormalizer
from itn.chinese.inverse_normalizer import InverseNormalizer
- Normalizer (English TN)
from tn.chinese.normalizer import Normalizer
from tn.english.normalizer import Normalizer as EnNormalizer
Quickstart
from tn.chinese.normalizer import Normalizer as ZhNormalizer
from itn.chinese.inverse_normalizer import InverseNormalizer
from tn.english.normalizer import Normalizer as EnNormalizer
# Chinese Text Normalization with erhua removal
zh_tn_model = ZhNormalizer(remove_erhua=True, overwrite_cache=True)
zh_tn_text = "你好WeTextProcessing 1.0,全新版本儿,简直666"
print(f"Chinese TN: {zh_tn_text} => {zh_tn_model.normalize(zh_tn_text)}")
# Chinese Inverse Text Normalization
zh_itn_model = InverseNormalizer(enable_0_to_9=False, overwrite_cache=True)
zh_itn_text = "你好WeTextProcessing 一点零,全新版本儿,简直六六六"
print(f"Chinese ITN: {zh_itn_text} => {zh_itn_model.normalize(zh_itn_text)}")
# English Text Normalization
en_tn_model = EnNormalizer(overwrite_cache=True)
en_tn_text = "Hello WeTextProcessing 1.0, life is short, just use wetext, 666, 9 and 10"
print(f"English TN: {en_tn_text} => {en_tn_model.normalize(en_tn_text)}")