rjieba
rjieba is a high-performance Python binding for the `jieba-rs` Rust library, offering efficient Chinese word segmentation. It aims to provide faster processing speeds compared to pure Python implementations by leveraging Rust's performance. The current version is 0.2.0. Releases are infrequent and typically driven by significant updates to the underlying `jieba-rs` library or `pyo3` binding infrastructure.
Common errors
-
ModuleNotFoundError: No module named 'rjieba'
cause The `rjieba` package was not successfully installed or is not accessible in the current Python environment. This can happen if the installation failed, especially on platforms without pre-built wheels requiring compilation.fixRun `pip install rjieba` again. If it fails with compilation errors, ensure you have Rust and a C compiler installed, or check if your platform is supported by pre-built wheels. -
UnicodeEncodeError: 'gbk' codec can't encode character '\u201c' in position X: illegal multibyte sequence
cause This error, though often associated with `jieba`, can occur with any text processing library if the input text or file encoding is misidentified, particularly when dealing with Chinese characters on systems where the default encoding is not UTF-8 (e.g., some Windows environments).fixExplicitly ensure all input text is handled as UTF-8. When reading files, specify `encoding='utf-8'` (e.g., `open('file.txt', 'r', encoding='utf-8')`). If dealing with external data, convert it to UTF-8 before passing to `rjieba`.
Warnings
- gotcha Unlike the original `jieba` Python library, `rjieba` does not typically require an explicit `initialize()` call. Dictionaries are embedded and loaded automatically upon first use, which simplifies setup but might be unexpected for users familiar with `jieba`'s initialization patterns.
- gotcha As a Rust binding, `rjieba` relies on pre-compiled wheels for easy installation across different Python versions, operating systems, and architectures. If a pre-built wheel is not available for your specific environment, `pip install rjieba` might fail, requiring a Rust toolchain to compile from source.
- breaking While no explicit breaking changes are documented for `rjieba` itself, significant updates to the underlying `jieba-rs` Rust library (e.g., from `0.7.x` to `0.8.x`) could introduce subtle behavioral changes or new features that might affect `rjieba`'s output or API in future versions.
Install
-
pip install rjieba
Imports
- rjieba
import rjieba
Quickstart
import rjieba
text = '我们中出了一个叛徒'
segmented_text = rjieba.cut(text)
print(f"Segmented (cut): {list(segmented_text)}")
tagged_text = rjieba.tag(text)
print(f"Tagged: {list(tagged_text)}")