Konoha: Japanese Tokenizer Wrapper

5.7.0 · active · verified Thu Apr 16

Konoha is a Python library (v5.7.0) that provides a unified, easy-to-use interface for various Japanese tokenizers, including MeCab, Sudachi, and Sentencepiece. It allows developers to seamlessly switch between different tokenizers and also offers rule-based tokenizers (whitespace, character) and a sentence splitter. The library is actively maintained with its latest release in March 2026.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates basic word-level tokenization of a Japanese sentence using the `WordTokenizer` with MeCab. Ensure the necessary tokenizer is installed as an extra.

from konoha import WordTokenizer

sentence = '自然言語処理を勉強しています'

# Initialize with a supported tokenizer (e.g., MeCab)
# Ensure 'konoha[mecab]' or 'konoha[all]' is installed
tokenizer = WordTokenizer('MeCab')

tokens = tokenizer.tokenize(sentence)
print([token.surface for token in tokens])

view raw JSON →