Zhon
Zhon is a Python library that provides constants commonly used in Chinese text processing, such as CJK characters, Chinese punctuation marks, and patterns for Pinyin and Zhuyin. The library is currently active, with version 2.1.1 released in November 2024, and generally follows an infrequent release cadence.
Common errors
-
ERROR: Package 'zhon' requires a different Python: ...
cause Attempting to install or run `zhon` version 2.1.1 or higher with a Python version older than 3.9.fixEnsure your Python environment is version 3.9 or newer. Use `python --version` to check. If needed, create a new virtual environment with a compatible Python version. -
AttributeError: module 'zhon' has no attribute 'characters'
cause Trying to access constants directly from the top-level `zhon` module instead of its submodules (e.g., `zhon.hanzi`, `zhon.pinyin`).fixConstants are organized into thematic submodules. Correctly import and access them, for example: `import zhon.hanzi` then `zhon.hanzi.characters`.
Warnings
- gotcha The `zhon.hanzi.characters` constant represents all Han ideographs, which includes characters used in Chinese, Japanese, and Korean. It does not strictly filter for 'Chinese only' characters and should not be solely relied upon for language identification.
- breaking Older versions of Zhon (prior to 2.x) supported Python 2.7. Version 2.1.1 and newer explicitly require Python >= 3.9. Attempting to use Zhon 2.x on older Python environments will result in installation or runtime errors.
- deprecated Earlier documentation and examples for regex pattern building sometimes used the `%s` operator for string formatting (e.g., `re.findall('[%s]' % zhon.hanzi.characters)`). While still functional in some cases, the f-string or `.format()` method is the modern and recommended approach in Python 3 for clarity and safety.
Install
-
pip install zhon
Imports
- hanzi
import zhon.hanzi
- pinyin
import zhon.pinyin
- zhuyin
import zhon.zhuyin
- cedict
import zhon.cedict
Quickstart
import re
import zhon.hanzi
text = 'I broke a plate: 我打破了一个盘子.'
cjk_characters = re.findall('[{}]'.format(zhon.hanzi.characters), text)
print(f"Original text: {text}")
print(f"Found CJK characters: {cjk_characters}")
# Example with Pinyin
import zhon.pinyin
pinyin_text = 'Yuànzi lǐ tíngzhe yí liàng chē.'
pinyin_words = re.findall(zhon.pinyin.word, pinyin_text, re.I)
print(f"Original Pinyin text: {pinyin_text}")
print(f"Found Pinyin words: {pinyin_words}")