TinySegmenter

0.4 · maintenance · verified Mon Apr 13

TinySegmenter in Python is a Python port of the original JavaScript-based TinySegmenter, an extremely compact (23KB) Japanese tokenizer. It offers character-based segmentation with approximately 95% precision for Japanese news articles, compatible with MeCab + IPADic segmentation units, without relying on external dictionaries. The latest version, 0.4, was released on September 16, 2018, and its development is not actively maintained, though contributions are welcome.

Warnings

Install

Imports

Quickstart

Initialize the TinySegmenter and tokenize a Japanese string into a list of words.

import tinysegmenter

segmenter = tinysegmenter.TinySegmenter()
text = "私の名前は中野です"
tokens = segmenter.tokenize(text)
print(' | '.join(tokens))
# Expected output: 私 | の | 名前 | は | 中野 | です

view raw JSON →