TinySegmenter
TinySegmenter in Python is a Python port of the original JavaScript-based TinySegmenter, an extremely compact (23KB) Japanese tokenizer. It offers character-based segmentation with approximately 95% precision for Japanese news articles, compatible with MeCab + IPADic segmentation units, without relying on external dictionaries. The latest version, 0.4, was released on September 16, 2018, and its development is not actively maintained, though contributions are welcome.
Warnings
- gotcha The project is explicitly stated by its maintainer as not being actively developed, with limited maintenance. New features or rapid bug fixes are unlikely.
- gotcha As a 'very compact' tokenizer, TinySegmenter makes trade-offs in accuracy and performance compared to larger, more sophisticated Japanese NLP libraries. While suitable for lightweight tasks, it might not offer the highest precision or speed for complex or large-scale Japanese text processing.
- gotcha Although the `tinysegmenter` 0.4 package states compatibility with Python 3, a prominent fork named `tinysegmenter3` exists specifically to provide improved Python 3 compatibility and enhanced performance. This implies that the original `tinysegmenter` might not be fully optimized or as robust for modern Python 3 environments as its dedicated Python 3 fork.
Install
-
pip install tinysegmenter
Imports
- TinySegmenter
from tinysegmenter import TinySegmenter
Quickstart
import tinysegmenter
segmenter = tinysegmenter.TinySegmenter()
text = "私の名前は中野です"
tokens = segmenter.tokenize(text)
print(' | '.join(tokens))
# Expected output: 私 | の | 名前 | は | 中野 | です