UniDic for Python
UniDic is a dictionary for the MeCab morphological analyzer, specifically designed for modern written Japanese. The `unidic` Python package provides this dictionary data, allowing it to be easily integrated with MeCab wrappers like `fugashi` or `mecab-python3`. The current version is 1.1.0, and it generally releases new versions to incorporate updated UniDic data or minor quality-of-life improvements.
Warnings
- breaking Version 1.1.0 changed the 'latest' or default UniDic version from 2.3.0 to 3.1.0. If your application relied on the older default dictionary, its behavior may change.
- gotcha The `unidic` package itself is small, but after `pip install`, you MUST run `python -m unidic download` to fetch the actual dictionary data. This download can be large (around 770MB-1GB on disk). If this step is skipped, the package will not function.
- gotcha UniDic is a dictionary, not an analyzer. It requires a separate MeCab wrapper like `fugashi` or `mecab-python3` to perform morphological analysis. Without one of these, `unidic` alone cannot process text.
- gotcha The `unidic` dictionary is specifically for Japanese language processing. It is unnecessary and incorrect to install and use it if your application does not require Japanese text analysis.
Install
-
pip install unidic -
python -m unidic download
Imports
- unidic
import unidic
- DICDIR
import unidic unidic.DICDIR
Quickstart
import unidic
import fugashi # or mecab-python3
import subprocess
import os
# Ensure dictionary is downloaded (important step!)
try:
subprocess.run(['python', '-m', 'unidic', 'download'], check=True, capture_output=True)
print("UniDic dictionary downloaded successfully.")
except subprocess.CalledProcessError as e:
if 'already exists' in e.stderr.decode():
print("UniDic dictionary already exists.")
else:
print(f"Error downloading UniDic dictionary: {e.stderr.decode()}")
# Handle error or exit if download failed
# Initialize MeCab Tagger with UniDic
tagger = fugashi.Tagger(f'-d "{unidic.DICDIR}"')
# Analyze a Japanese sentence
sentence = "今日の天気は晴れです"
result = tagger.parse(sentence)
print(f"Sentence: {sentence}")
print(f"Analysis: \n{result}")
# Access individual tokens (fugashi specific)
words = tagger.parseToNodeList(sentence)
for word in words:
if word.surface == '': # Skip empty node for EOS
continue
print(f"Surface: {word.surface}, Lemma: {word.lemma}, POS: {word.pos1}")