natto-py (MeCab binding)
natto-py is a Python package that provides a Foreign Function Interface (FFI) binding to MeCab, the part-of-speech and morphological analyzer for the Japanese language. It allows Python applications to leverage MeCab's capabilities without requiring SWIG or a C compiler for installation. The current version is 1.0.1, and the library is actively maintained with an irregular release cadence.
Warnings
- breaking Version 1.0.0 of natto-py formally dropped support for Python 2.x. Users who still require Python 2 compatibility must use `natto-py==0.9.2`.
- gotcha natto-py requires a pre-installed MeCab library (v0.996+) and a system dictionary (e.g., IPA, Unidic) on the operating system. These are not bundled. If natto-py cannot automatically locate MeCab, you may need to set the `MECAB_PATH` and `MECAB_CHARSET` environment variables.
- gotcha For robust applications, it is highly recommended to instantiate `MeCab` using a Python `with` statement (e.g., `with MeCab() as nm:`). This ensures that MeCab's internal resources are properly cleaned up when the object goes out of scope, preventing potential memory leaks or crashes.
- gotcha For detailed morphological analysis, prefer parsing with `as_nodes=True` and customizing MeCab's output format using the `-F` (node-format) and `-U` (unknown-format) options during `MeCab` instantiation. This provides structured `MeCabNode` objects with rich features, avoiding brittle manual string parsing.
Install
-
pip install natto-py
Imports
- MeCab
from natto import MeCab
- MeCabNode
from natto.mecab import MeCabNode
Quickstart
import os
from natto import MeCab
# Optional: Set MeCab path and charset if auto-detection fails
# os.environ['MECAB_PATH'] = os.environ.get('MECAB_PATH', '/usr/local/lib/libmecab.so')
# os.environ['MECAB_CHARSET'] = os.environ.get('MECAB_CHARSET', 'utf8')
# Instantiate MeCab with recommended options for detailed parsing
# -F: node-format for features, -U: unk-format for unknown words
with MeCab(r'-F%m,%f[0],%f[1],%f[2],%f[3],%f[4],%f[5],%f[6],%f[7],%f[8]\n -U?,?,?,?,?,?,?,?,?,?\n') as nm:
text = 'これは日本語のテキストです。'
print(f"Parsed text (string output):\n{nm.parse(text)}\n")
print("Parsed text (node output with features):")
for n in nm.parse(text, as_nodes=True):
if not n.is_eos(): # Ignore end-of-sentence nodes
print(f'Surface: {n.surface}, Feature: {n.feature}, Cost: {n.cost}')