UniDic for Python

1.1.0 · active · verified Wed Apr 15

UniDic is a dictionary for the MeCab morphological analyzer, specifically designed for modern written Japanese. The `unidic` Python package provides this dictionary data, allowing it to be easily integrated with MeCab wrappers like `fugashi` or `mecab-python3`. The current version is 1.1.0, and it generally releases new versions to incorporate updated UniDic data or minor quality-of-life improvements.

Warnings

Install

Imports

Quickstart

After installing `unidic` and a MeCab wrapper like `fugashi` (or `mecab-python3`), you must first download the actual dictionary data using `python -m unidic download`. Then, you can import `unidic` and pass `unidic.DICDIR` to your MeCab tagger to enable Japanese morphological analysis.

import unidic
import fugashi # or mecab-python3
import subprocess
import os

# Ensure dictionary is downloaded (important step!)
try:
    subprocess.run(['python', '-m', 'unidic', 'download'], check=True, capture_output=True)
    print("UniDic dictionary downloaded successfully.")
except subprocess.CalledProcessError as e:
    if 'already exists' in e.stderr.decode():
        print("UniDic dictionary already exists.")
    else:
        print(f"Error downloading UniDic dictionary: {e.stderr.decode()}")
        # Handle error or exit if download failed

# Initialize MeCab Tagger with UniDic
tagger = fugashi.Tagger(f'-d "{unidic.DICDIR}"')

# Analyze a Japanese sentence
sentence = "今日の天気は晴れです"
result = tagger.parse(sentence)
print(f"Sentence: {sentence}")
print(f"Analysis: \n{result}")

# Access individual tokens (fugashi specific)
words = tagger.parseToNodeList(sentence)
for word in words:
    if word.surface == '': # Skip empty node for EOS
        continue
    print(f"Surface: {word.surface}, Lemma: {word.lemma}, POS: {word.pos1}")

view raw JSON →