Sudachi Dictionary Core Edition

20260116 · active · verified Sat Apr 11

SudachiDict-core is the default dictionary for SudachiPy, a Python-based Japanese morphological analyzer. It provides a comprehensive basic vocabulary for tokenization and linguistic analysis. The dictionary packages are updated frequently, often multiple times a quarter, incorporating new words and improving synonym definitions. The current version is 20260116.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use the 'sudachidict-core' dictionary through the 'SudachiPy' library to perform Japanese morphological analysis. It initializes the tokenizer with the core dictionary and then tokenizes an example Japanese sentence.

from sudachipy import Dictionary, SplitMode

# Initialize the Sudachi dictionary (core edition is used by default if installed)
# dict_type='core' explicitly ensures the core dictionary is loaded.
# The dictionary files are loaded from the installed sudachidict-core package.
dict_obj = Dictionary(dict_type='core')
tokenizer = dict_obj.create()

text = "外国人参政権"

# Perform tokenization in mode A (shortest path)
mode = SplitMode.A
# morphemes = tokenizer.tokenize(text, mode)
# For SudachiPy v0.6.0+ (sudachi.rs-based), mode is passed at tokenizer creation
# Example for v0.6.0+ (requires updating sudachipy install if not latest)
# tokenizer_a = dict_obj.create(mode=SplitMode.A)
# morphemes = tokenizer_a.tokenize(text)

# For compatibility with older SudachiPy (pre-v0.6.0) or simpler quickstart:
# Use the example from SudachiPy's README, which passes mode to tokenize()
morphemes = tokenizer.tokenize(text, mode)

print(f"Original text: {text}")
print(f"Tokens (Mode A): {[m.surface() for m in morphemes]}")

# Example accessing morpheme details
if morphemes:
    first_morpheme = morphemes[0]
    print(f"\nFirst morpheme: {first_morpheme.surface()}")
    print(f"  Part-of-speech: {first_morpheme.part_of_speech()}")
    print(f"  Normalized form: {first_morpheme.normalized_form()}")
    print(f"  Dictionary form: {first_morpheme.dictionary_form()}")
    print(f"  Reading form: {first_morpheme.reading_form()}")

view raw JSON →