SudachiDict (Small Edition) for SudachiPy

raw JSON →
20260428 verified Fri May 01 auth: no python

A small-sized dictionary for use with SudachiPy, the Japanese morphological analyzer. This package provides the dictionary core required by sudachipy to perform tokenization and part-of-speech tagging. Current version 20260428 is regularly updated (monthly) with neologisms and synonyms. It is the recommended dictionary for most use cases, balancing coverage and performance.

pip install sudachidict-small
error ImportError: cannot import name 'Small' from 'sudachidict_small'
cause Trying to import a class from the dictionary package, which does not expose Python symbols.
fix
Do not import from sudachidict_small. Instead, import from sudachipy and specify dict_type='small'.
error sudachipy.errors.MultipleDictionaryError: multiple dictionaries found for 'small'
cause Having both sudachidict-small and sudachidict-core installed simultaneously with the same dict_type name.
fix
Uninstall the extra dictionary: pip uninstall sudachidict-core (or use distinct dict_type names).
breaking Removal of legacy dict_type names: 'small', 'core', 'full' used to be set directly; now they are replaced by package names (e.g., 'sudachidict-small'). If you set dict_type='small' in old code, it may break.
fix Use the new style: tokenizer.Tokenizer(dict_type='sudachidict-small') or omit (defaults to small).
gotcha Do not import sudachidict_small directly. The dictionary package only contains data files; importing it yields no useful symbols and may cause confusion.
fix Always use sudachipy to access the dictionary. The package is automatically loaded by sudachipy based on dict_type.

Initialize SudachiPy tokenizer using the small dictionary and tokenize a sample sentence.

from sudachipy import tokenizer
from sudachipy import dictionary

# Instantiate tokenizer with the small dictionary
tokenizer_obj = tokenizer.Tokenizer()
# Or explicitly: tokenizer_obj = tokenizer.Tokenizer(dict_type='small')
morphemes = tokenizer_obj.tokenize('本日は晴天なり')
for m in morphemes:
    print(f"{m.surface()}\t{m.part_of_speech()}")