UniDic-lite
unidic-lite is a small version of UniDic, a Japanese morphological analysis dictionary, packaged for Python. It is designed to be installable directly via pip without requiring additional downloads, unlike the larger 'unidic' package. It uses UniDic 2.1.2 from 2013 and occupies approximately 250MB of disk space after installation. The current version is 1.0.8, released in January 2021, and its release cadence is infrequent as it primarily serves as a static dictionary resource.
Warnings
- gotcha unidic-lite is solely a dictionary resource. To perform Japanese morphological analysis, a separate MeCab wrapper library like `fugashi` or `mecab-python3` must be installed and used in conjunction with unidic-lite.
- gotcha Despite 'lite' in its name, unidic-lite requires approximately 250MB of disk space for the dictionary data after installation.
- gotcha unidic-lite is based on UniDic 2.1.2 from 2013. This older version may lack vocabulary for modern terms and phrases compared to the full `unidic` package (which uses UniDic 3.1.0 and is much larger).
- gotcha The unidic-lite dictionary has minor modifications from the official UniDic release, including added entries for '令和', removal of single-character numeric and alphabetic words, and changes to `unk.def`. These might lead to slightly different tokenization results compared to an unmodified UniDic.
- gotcha Users frequently encounter 'Failed initializing MeCab' errors when using MeCab wrappers with unidic-lite. This often stems from the wrapper failing to locate the dictionary or underlying MeCab installation issues (e.g., missing C++ redistributables on Windows).
Install
-
pip install unidic-lite
Imports
- DICDIR
import unidic_lite; print(unidic_lite.DICDIR)
Quickstart
import unidic_lite
from fugashi import Tagger
# unidic-lite needs to be explicitly passed to the Tagger
tagger = Tagger(f'-d "{unidic_lite.DICDIR}"')
text = "すもももももももものうち"
# Analyze the text
words = []
for word in tagger(text):
words.append(f'{word.surface}\t{word.feature.pos1}\t{word.feature.lemma}')
print('\n'.join(words))