Jieba Chinese Word Segmentation
Jieba is a popular open-source Python library for Chinese word segmentation, often referred to as 'jieba' (meaning 'to cut' or 'to slice'). It provides various segmentation modes, including dictionary-based, HMM (Hidden Markov Model), and a newer deep learning mode based on PaddlePaddle. The current version is 0.42.1, with an active development cadence, releasing several updates per year to address bugs and add features.
Warnings
- gotcha Using the deep learning Paddle mode (`use_paddle=True`) requires `paddlepaddle` to be installed separately and can be significantly more resource-intensive and slower than the default HMM mode. Ensure `paddlepaddle` is installed (`pip install jieba[paddle]`) and monitor performance.
- gotcha Prior to v0.42, passing an empty string to `jieba.cut` in Paddle mode (`use_paddle=True`) could lead to a coredump. While fixed in v0.42, users on older versions or those not updating should ensure input strings are not empty when using this mode.
- gotcha The `cut_all=True` (full mode) has historically had issues with correctly segmenting mixed English and Chinese text, and in some older versions (prior to v0.42), could potentially drop characters. While many issues have been addressed, users requiring absolute precision in mixed text might need to verify output or consider other modes.
- gotcha When working with custom `Tokenizer` instances, ensure that `tokenizer.add_word()` and `tokenizer.del_word()` are used on the instance itself. Prior to v0.40, `add_word` could incorrectly affect the global default `Tokenizer`. Also, custom dictionaries with hyphens (`-`) were buggy before v0.40.
Install
-
pip install jieba -
pip install jieba[paddle]
Imports
- jieba
import jieba
Quickstart
import jieba
# Default segmentation mode (HMM + dictionary)
text = "我来到北京清华大学"
seg_list = jieba.cut(text, cut_all=False)
print(f"Default Mode: {'/'.join(seg_list)}")
# Full segmentation mode (all possible words)
seg_list_all = jieba.cut(text, cut_all=True)
print(f"Full Mode: {'/'.join(seg_list_all)}")
# Search engine mode (short words for indexing)
seg_list_search = jieba.cut_for_search(text)
print(f"Search Engine Mode: {'/'.join(seg_list_search)}")
# Custom dictionary loading
# You would typically have a file named user.dict in the same directory
# with words and their frequencies/parts of speech, e.g.,
# 结巴 3
# 人工智能 5
# jieba.load_userdict('user.dict')