WeTextProcessing

1.0.4.1 · active · verified Thu Apr 16

WeTextProcessing is an active Python library providing production-ready Text Normalization (TN) and Inverse Text Normalization (ITN) capabilities. It primarily supports Chinese, English, and Japanese languages, leveraging Finite State Transducers (FSTs) for efficient processing. The library has a consistent release cadence, with multiple minor updates released throughout 2024 to introduce new features, improvements, and bug fixes.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to perform Chinese Text Normalization (TN), Chinese Inverse Text Normalization (ITN), and English Text Normalization using the `WeTextProcessing` library. It showcases specific imports for each language and the use of `overwrite_cache=True` when modifying normalizer parameters, ensuring rules are rebuilt.

from tn.chinese.normalizer import Normalizer as ZhNormalizer
from itn.chinese.inverse_normalizer import InverseNormalizer
from tn.english.normalizer import Normalizer as EnNormalizer

# Chinese Text Normalization with erhua removal
zh_tn_model = ZhNormalizer(remove_erhua=True, overwrite_cache=True)
zh_tn_text = "你好WeTextProcessing 1.0,全新版本儿,简直666"
print(f"Chinese TN: {zh_tn_text} => {zh_tn_model.normalize(zh_tn_text)}")

# Chinese Inverse Text Normalization
zh_itn_model = InverseNormalizer(enable_0_to_9=False, overwrite_cache=True)
zh_itn_text = "你好WeTextProcessing 一点零,全新版本儿,简直六六六"
print(f"Chinese ITN: {zh_itn_text} => {zh_itn_model.normalize(zh_itn_text)}")

# English Text Normalization
en_tn_model = EnNormalizer(overwrite_cache=True)
en_tn_text = "Hello WeTextProcessing 1.0, life is short, just use wetext, 666, 9 and 10"
print(f"English TN: {en_tn_text} => {en_tn_model.normalize(en_tn_text)}")

view raw JSON →