Jieba Chinese Word Segmentation

0.42.1 · active · verified Sat Apr 11

Jieba is a popular open-source Python library for Chinese word segmentation, often referred to as 'jieba' (meaning 'to cut' or 'to slice'). It provides various segmentation modes, including dictionary-based, HMM (Hidden Markov Model), and a newer deep learning mode based on PaddlePaddle. The current version is 0.42.1, with an active development cadence, releasing several updates per year to address bugs and add features.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the basic usage of Jieba for Chinese word segmentation, showing the default, full, and search engine modes. It also includes a comment on how to load a custom user dictionary.

import jieba

# Default segmentation mode (HMM + dictionary)
text = "我来到北京清华大学"
seg_list = jieba.cut(text, cut_all=False)
print(f"Default Mode: {'/'.join(seg_list)}")

# Full segmentation mode (all possible words)
seg_list_all = jieba.cut(text, cut_all=True)
print(f"Full Mode: {'/'.join(seg_list_all)}")

# Search engine mode (short words for indexing)
seg_list_search = jieba.cut_for_search(text)
print(f"Search Engine Mode: {'/'.join(seg_list_search)}")

# Custom dictionary loading
# You would typically have a file named user.dict in the same directory
# with words and their frequencies/parts of speech, e.g.,
# 结巴 3
# 人工智能 5
# jieba.load_userdict('user.dict')

view raw JSON →