Sudachi Dictionary (Full Edition)
sudachidict-full is a data package that provides the largest, 'full' edition of the Japanese dictionary for use with SudachiPy, a powerful Japanese morphological analyzer. It does not provide direct Python APIs for tokenization but serves as a dependency for SudachiPy. The current version is 20260116, and new versions are released regularly (typically every 2-3 months) to update dictionary entries and synonyms.
Warnings
- gotcha sudachidict-full is a data package providing dictionary files, not a standalone library for performing morphological analysis. You must install `sudachipy` separately to utilize the dictionary data.
- breaking Starting from version 20251022, Sudachi's internal dictionary normalization has been partly discontinued and replaced with a synonym dictionary. This change may lead to different tokenization results or altered behavior for applications relying on the previous normalization process.
- gotcha When multiple Sudachi dictionaries (e.g., `sudachidict-small`, `sudachidict-core`, `sudachidict-full`) are installed, `sudachipy`'s default `dictionary.Dictionary().create()` method will automatically prioritize and load the largest available dictionary. To explicitly guarantee the 'full' dictionary is used, you can initialize with `dict_type='full'` (e.g., `dictionary.Dictionary(dict_type='full').create()`).
Install
-
pip install sudachidict-full sudachipy
Imports
- Path
from sudachidict_full.dictionary import Path
Quickstart
from sudachipy import tokenizer
from sudachipy import dictionary
# sudachidict-full must be installed for this to load the full dictionary.
# SudachiPy automatically selects the largest installed dictionary by default.
# To explicitly ensure the 'full' dictionary is used, you can pass dict_type='full'.
# tokenizer_obj = dictionary.Dictionary(dict_type='full').create()
# Create a Sudachi tokenizer instance (will use the 'full' dict if installed)
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C
text = "寿司は美味しい。"
# Tokenize the text
print(f"Original text: {text}")
morphemes = tokenizer_obj.tokenize(text, mode)
print("\nTokenization results (Surface form, Part-of-Speech, Base form):")
for m in morphemes:
print(f" {m.surface()}\t{m.part_of_speech()}\t{m.base_form()}")
# Example of getting the dictionary path (for advanced configuration)
# import sudachidict_full
# dict_path = sudachidict_full.dictionary.Path()
# print(f"\nPath to the 'full' dictionary: {dict_path}")