Konoha: Japanese Tokenizer Wrapper
Konoha is a Python library (v5.7.0) that provides a unified, easy-to-use interface for various Japanese tokenizers, including MeCab, Sudachi, and Sentencepiece. It allows developers to seamlessly switch between different tokenizers and also offers rule-based tokenizers (whitespace, character) and a sentence splitter. The library is actively maintained with its latest release in March 2026.
Common errors
-
ModuleNotFoundError: No module named 'MeCab'
cause The MeCab tokenizer dependency (mecab-python3) was not installed along with konoha.fixInstall konoha with the mecab extra: `pip install 'konoha[mecab]'` or `pip install 'konoha[all]'`. -
RuntimeError: 'sudachipy' is not installed. Please install it with 'pip install sudachipy'
cause The Sudachi tokenizer dependency (sudachipy and sudachidict_core) was not installed.fixInstall konoha with the sudachi extra: `pip install 'konoha[sudachi]'` or `pip install 'konoha[all]'`. -
TypeError: __init__() missing 1 required positional argument: 'model_path'
cause Attempted to initialize `WordTokenizer('Sentencepiece')` without providing the `model_path` argument.fixProvide the path to your Sentencepiece model file: `WordTokenizer('Sentencepiece', model_path="path/to/your/model.spm")`.
Warnings
- gotcha Installing `konoha` without specifying extras (e.g., `pip install konoha`) will only install the sentence splitter, not any word tokenizers. To use tokenizers like MeCab or Sudachi, you must install `konoha` with the corresponding extra (e.g., `konoha[mecab]`) or `konoha[all]` for all supported tokenizers.
- breaking The API endpoint paths for the Docker quickstart (e.g., `/api/v1/tokenize`) changed in v4.6.4. Older `curl` commands or Docker configurations might fail.
- gotcha When using the `Sentencepiece` tokenizer, you must provide a valid `model_path` argument to `WordTokenizer`. Omitting it will result in an error or unexpected behavior.
Install
-
pip install 'konoha[all]' -
pip install 'konoha[mecab]' -
pip install konoha
Imports
- WordTokenizer
from konoha import WordTokenizer
- SentenceTokenizer
from konoha import SentenceTokenizer
Quickstart
from konoha import WordTokenizer
sentence = '自然言語処理を勉強しています'
# Initialize with a supported tokenizer (e.g., MeCab)
# Ensure 'konoha[mecab]' or 'konoha[all]' is installed
tokenizer = WordTokenizer('MeCab')
tokens = tokenizer.tokenize(sentence)
print([token.surface for token in tokens])