Sudachi Dictionary Core Edition

20260116 · active · verified Sat Apr 11

SudachiDict-core is the default dictionary for SudachiPy, a Python-based Japanese morphological analyzer. It provides a comprehensive basic vocabulary for tokenization and linguistic analysis. The dictionary packages are updated frequently, often multiple times a quarter, incorporating new words and improving synonym definitions. The current version is 20260116.

Warnings

breaking Dictionary updates are versioned by date (e.g., '20260116'), not semantic versioning. Frequent updates can introduce changes to tokenization, part-of-speech tags, and normalization behavior, particularly due to additions/modifications in 'synonyms.txt'.
Fix: Pin the 'sudachidict-core' version in your project dependencies (e.g., `sudachidict-core==20260116`) to ensure consistent behavior. Regularly review release notes for significant changes if upgrading.
gotcha This package is a dictionary resource, not a Python library providing direct classes or functions for import. Its role is to supply data to the 'SudachiPy' morphological analyzer. Attempting to `import sudachidict_core` directly will likely result in an ImportError or unexpected behavior.
Fix: Install `sudachidict-core` via `pip`, then use `sudachipy.Dictionary(dict_type='core').create()` to load and utilize the dictionary through `SudachiPy`.
gotcha Sudachi offers three dictionary editions: 'small', 'core' (default), and 'full'. Each has a different scope of vocabulary. Using 'core' when 'full' is needed for specific proper nouns (or vice versa) will lead to suboptimal tokenization results.
Fix: Choose the appropriate dictionary edition (`sudachidict-small`, `sudachidict-core`, or `sudachidict-full`) based on your application's requirements. 'Core' is a good general-purpose choice, while 'full' includes more proper nouns. Install the specific dictionary package and ensure SudachiPy is configured to use it.
gotcha The actual dictionary files (e.g., `system.dic`) are not bundled directly within the `sudachidict-core` Python package. Instead, they are downloaded from a remote server during the `pip install` process. This requires an active internet connection during installation.
Fix: Ensure that the environment where `pip install sudachidict-core` is run has an internet connection. In restricted environments, you may need to pre-download the dictionary files or configure a local package mirror.
deprecated For SudachiPy versions prior to v0.5.2, a separate `sudachipy link` command was often required to make the dictionary available. This command is no longer available in newer `SudachiPy` versions (v0.5.2 and later).
Fix: For modern `SudachiPy` (v0.5.2+), simply installing `sudachidict-core` (or other editions) makes them discoverable by default. You can explicitly specify `dict_type='core'` when creating a `Dictionary` object if needed.

Install

pip install sudachipy sudachidict-core Install SudachiPy and Core Dictionary

Imports

sudachidict-core
```
This is a data-only package. It provides dictionary resources for SudachiPy and is not directly imported into Python code.
```
sudachidict-core installs dictionary files that SudachiPy automatically discovers or can be explicitly configured to use. You do not import symbols from 'sudachidict-core' itself.

Quickstart

This quickstart demonstrates how to use the 'sudachidict-core' dictionary through the 'SudachiPy' library to perform Japanese morphological analysis. It initializes the tokenizer with the core dictionary and then tokenizes an example Japanese sentence.

from sudachipy import Dictionary, SplitMode

# Initialize the Sudachi dictionary (core edition is used by default if installed)
# dict_type='core' explicitly ensures the core dictionary is loaded.
# The dictionary files are loaded from the installed sudachidict-core package.
dict_obj = Dictionary(dict_type='core')
tokenizer = dict_obj.create()

text = "外国人参政権"

# Perform tokenization in mode A (shortest path)
mode = SplitMode.A
# morphemes = tokenizer.tokenize(text, mode)
# For SudachiPy v0.6.0+ (sudachi.rs-based), mode is passed at tokenizer creation
# Example for v0.6.0+ (requires updating sudachipy install if not latest)
# tokenizer_a = dict_obj.create(mode=SplitMode.A)
# morphemes = tokenizer_a.tokenize(text)

# For compatibility with older SudachiPy (pre-v0.6.0) or simpler quickstart:
# Use the example from SudachiPy's README, which passes mode to tokenize()
morphemes = tokenizer.tokenize(text, mode)

print(f"Original text: {text}")
print(f"Tokens (Mode A): {[m.surface() for m in morphemes]}")

# Example accessing morpheme details
if morphemes:
    first_morpheme = morphemes[0]
    print(f"\nFirst morpheme: {first_morpheme.surface()}")
    print(f"  Part-of-speech: {first_morpheme.part_of_speech()}")
    print(f"  Normalized form: {first_morpheme.normalized_form()}")
    print(f"  Dictionary form: {first_morpheme.dictionary_form()}")
    print(f"  Reading form: {first_morpheme.reading_form()}")

view raw JSON →