Fugashi: Fast Pythonic Japanese Tokenization
Fugashi is a Cython wrapper for MeCab, providing fast and Pythonic Japanese tokenization and morphological analysis. It offers pre-built wheels for common platforms and simplifies dictionary installation, primarily recommending UniDic. The library is actively maintained and currently at version 1.5.2.
Warnings
- breaking Support for Python 3.6 and earlier versions was dropped in fugashi v1.2.0. Users on older Python versions must upgrade or use fugashi v1.1.2 or earlier.
- gotcha Fugashi requires a MeCab dictionary to function. Forgetting to install one (e.g., `unidic-lite` or `unidic`) is a common error and will lead to initialization failures.
- gotcha On platforms where pre-built wheels are not available (e.g., musl-based Linux distros like Alpine, PowerPC, or Windows 32-bit), MeCab itself must be installed from source *before* installing fugashi.
- deprecated Earlier versions of MeCab wrappers (like `mecab-python3`) often returned a linked-list structure for `parseToNode`. Fugashi adopted a more Pythonic approach by returning a Python list of nodes, which is a significant API improvement for usability.
Install
-
pip install fugashi -
pip install 'fugashi[unidic-lite]' -
pip install 'fugashi[unidic]' && python -m unidic download
Imports
- Tagger
from fugashi import Tagger
- GenericTagger
from fugashi import GenericTagger
- create_feature_wrapper
from fugashi import create_feature_wrapper
Quickstart
from fugashi import Tagger
# Initialize Tagger with '-Owakati' for whitespace-separated output
tagger = Tagger('-Owakati')
text = "麩菓子は、麩を主材料とした日本の菓子。"
# Get whitespace-separated tokens
wakati_output = tagger.parse(text)
print(f"Wakati: {wakati_output}")
# Iterate through words to get detailed features (UniDic assumed by default)
print("\nDetailed analysis:")
for word in tagger(text):
print(f"Surface: {word.surface}\tLemma: {word.feature.lemma}\tPOS: {word.pos}")
# Example with GenericTagger and custom features (if not using UniDic or need specific fields)
# from fugashi import GenericTagger, create_feature_wrapper
# CustomFeatures = create_feature_wrapper('CustomFeatures', 'alpha beta gamma')
# custom_tagger = GenericTagger(wrapper=CustomFeatures)
# print("\nCustom Tagger example:")
# for word in custom_tagger.parseToNodeList("テスト"): # Example, requires a configured custom dictionary
# print(f"Surface: {word.surface}\tAlpha: {word.feature.alpha}")