Fugashi: Fast Pythonic Japanese Tokenization

1.5.2 · active · verified Mon Apr 13

Fugashi is a Cython wrapper for MeCab, providing fast and Pythonic Japanese tokenization and morphological analysis. It offers pre-built wheels for common platforms and simplifies dictionary installation, primarily recommending UniDic. The library is actively maintained and currently at version 1.5.2.

Warnings

Install

Imports

Quickstart

Initializes a `Tagger` for Japanese text, demonstrating basic tokenization (wakati) and accessing morphological features like lemma and part-of-speech using UniDic features. The `tagger()` call directly yields `Word` objects for convenient iteration and attribute access.

from fugashi import Tagger

# Initialize Tagger with '-Owakati' for whitespace-separated output
tagger = Tagger('-Owakati')

text = "麩菓子は、麩を主材料とした日本の菓子。"

# Get whitespace-separated tokens
wakati_output = tagger.parse(text)
print(f"Wakati: {wakati_output}")

# Iterate through words to get detailed features (UniDic assumed by default)
print("\nDetailed analysis:")
for word in tagger(text):
    print(f"Surface: {word.surface}\tLemma: {word.feature.lemma}\tPOS: {word.pos}")

# Example with GenericTagger and custom features (if not using UniDic or need specific fields)
# from fugashi import GenericTagger, create_feature_wrapper
# CustomFeatures = create_feature_wrapper('CustomFeatures', 'alpha beta gamma')
# custom_tagger = GenericTagger(wrapper=CustomFeatures)
# print("\nCustom Tagger example:")
# for word in custom_tagger.parseToNodeList("テスト"): # Example, requires a configured custom dictionary
#     print(f"Surface: {word.surface}\tAlpha: {word.feature.alpha}")

view raw JSON →