Hanzi Identifier
Hanzi Identifier is a Python module designed to identify Chinese text as either Simplified or Traditional characters. It leverages the CC-CEDICT data for character identification. The current stable version is 1.3.0. The library has an irregular release cadence, with major and minor updates occurring every few years.
Common errors
-
ModuleNotFoundError: No module named 'hanzidentifier'
cause The `hanzidentifier` library has not been installed in your current Python environment.fixRun `pip install hanzidentifier` in your terminal to install the package. -
My code expects `hanzidentifier.is_simplified()` to return `True` for a simplified text, but it returns `False` even though the text looks simplified.
cause This often occurs when the string contains characters that are common to both Simplified and Traditional Chinese, leading `identify()` to return `hanzidentifier.BOTH`. `is_simplified()` strictly checks if *all* Chinese characters can be categorized as Simplified *or* are common to both, but if there's any character exclusively Traditional or if the string is just 'BOTH' without other exclusively simplified, it might not return True.fixCheck the output of `hanzidentifier.identify()` first. If it returns `hanzidentifier.BOTH`, it means the characters are valid in both Simplified and Traditional contexts. Consider what your desired outcome is for such cases. -
hanzidentifier.identify('Some English text with 你好') returns `hanzidentifier.UNKNOWN`, even though there are Chinese characters.cause The `identify()` function returns `UNKNOWN` when it cannot determine the character system (Simplified, Traditional, Mixed, or Both) from the Chinese characters present. This might happen if the string primarily contains non-Chinese characters, or the Chinese characters found are too ambiguous in isolation.fixUse `hanzidentifier.has_chinese()` to confirm the presence of any Chinese characters. `identify()` focuses on categorizing the *type* of Chinese characters, not merely their existence. If a string has few identifiable Chinese characters amidst many non-Chinese, the identification might default to `UNKNOWN`.
Warnings
- breaking Version 1.0 (released 2014-04-12) introduced breaking changes, including renaming some constants. Code written for versions prior to 1.0 will likely fail.
- gotcha The `identify()` function may return `hanzidentifier.BOTH` for strings containing characters that are valid in both Simplified and Traditional Chinese character sets. This means `is_simplified()` or `is_traditional()` might return `False` if the string isn't *exclusively* of that type, even if it contains characters compatible with it.
- gotcha hanzidentifier is designed to identify Chinese characters. While many Japanese Kanji and Korean Hanja share ideographs with Chinese, this library does not distinguish between these languages. It will identify shared characters as Chinese.
Install
-
pip install hanzidentifier
Imports
- hanzidentifier
import hanzidentifier
- identify
from hanzidentifier import identify
- is_simplified
from hanzidentifier import is_simplified
- is_traditional
from hanzidentifier import is_traditional
- has_chinese
from hanzidentifier import has_chinese
Quickstart
import hanzidentifier
# Basic identification
print(f"'你好!' identifies as: {hanzidentifier.identify('你好!')}")
print(f"'你好!' is Simplified: {hanzidentifier.is_simplified('你好!')}")
print(f"'你好!' is Traditional: {hanzidentifier.is_traditional('你好!')}")
# Example with strictly Simplified Chinese
print(f"'软件' identifies as: {hanzidentifier.identify('软件')}")
print(f"'软件' is Simplified: {hanzidentifier.is_simplified('软件')}")
# Example with strictly Traditional Chinese
print(f"'軟體' identifies as: {hanzidentifier.identify('軟體')}")
print(f"'軟體' is Traditional: {hanzidentifier.is_traditional('軟體')}")
# Example with mixed characters
print(f"'国家和國家' identifies as: {hanzidentifier.identify('国家和國家')}")
# Example with no Chinese characters
print(f"'Hello World' has Chinese: {hanzidentifier.has_chinese('Hello World')}")
print(f"'Hello World' identifies as: {hanzidentifier.identify('Hello World')}")