Sudachi Dictionary Core Edition
SudachiDict-core is the default dictionary for SudachiPy, a Python-based Japanese morphological analyzer. It provides a comprehensive basic vocabulary for tokenization and linguistic analysis. The dictionary packages are updated frequently, often multiple times a quarter, incorporating new words and improving synonym definitions. The current version is 20260116.
Warnings
- breaking Dictionary updates are versioned by date (e.g., '20260116'), not semantic versioning. Frequent updates can introduce changes to tokenization, part-of-speech tags, and normalization behavior, particularly due to additions/modifications in 'synonyms.txt'.
- gotcha This package is a dictionary resource, not a Python library providing direct classes or functions for import. Its role is to supply data to the 'SudachiPy' morphological analyzer. Attempting to `import sudachidict_core` directly will likely result in an ImportError or unexpected behavior.
- gotcha Sudachi offers three dictionary editions: 'small', 'core' (default), and 'full'. Each has a different scope of vocabulary. Using 'core' when 'full' is needed for specific proper nouns (or vice versa) will lead to suboptimal tokenization results.
- gotcha The actual dictionary files (e.g., `system.dic`) are not bundled directly within the `sudachidict-core` Python package. Instead, they are downloaded from a remote server during the `pip install` process. This requires an active internet connection during installation.
- deprecated For SudachiPy versions prior to v0.5.2, a separate `sudachipy link` command was often required to make the dictionary available. This command is no longer available in newer `SudachiPy` versions (v0.5.2 and later).
Install
-
pip install sudachipy sudachidict-core
Imports
- sudachidict-core
This is a data-only package. It provides dictionary resources for SudachiPy and is not directly imported into Python code.
Quickstart
from sudachipy import Dictionary, SplitMode
# Initialize the Sudachi dictionary (core edition is used by default if installed)
# dict_type='core' explicitly ensures the core dictionary is loaded.
# The dictionary files are loaded from the installed sudachidict-core package.
dict_obj = Dictionary(dict_type='core')
tokenizer = dict_obj.create()
text = "外国人参政権"
# Perform tokenization in mode A (shortest path)
mode = SplitMode.A
# morphemes = tokenizer.tokenize(text, mode)
# For SudachiPy v0.6.0+ (sudachi.rs-based), mode is passed at tokenizer creation
# Example for v0.6.0+ (requires updating sudachipy install if not latest)
# tokenizer_a = dict_obj.create(mode=SplitMode.A)
# morphemes = tokenizer_a.tokenize(text)
# For compatibility with older SudachiPy (pre-v0.6.0) or simpler quickstart:
# Use the example from SudachiPy's README, which passes mode to tokenize()
morphemes = tokenizer.tokenize(text, mode)
print(f"Original text: {text}")
print(f"Tokens (Mode A): {[m.surface() for m in morphemes]}")
# Example accessing morpheme details
if morphemes:
first_morpheme = morphemes[0]
print(f"\nFirst morpheme: {first_morpheme.surface()}")
print(f" Part-of-speech: {first_morpheme.part_of_speech()}")
print(f" Normalized form: {first_morpheme.normalized_form()}")
print(f" Dictionary form: {first_morpheme.dictionary_form()}")
print(f" Reading form: {first_morpheme.reading_form()}")