SudachiPy
SudachiPy is a Python binding for Sudachi.rs, a Japanese morphological analyzer implemented in Rust. It provides multi-granular tokenization for Japanese text, handling various linguistic nuances. The current version is 0.6.10, with releases typically occurring every few months to incorporate updates, bug fixes, and Python version support.
Warnings
- breaking Support for Python 3.6, 3.7, and 3.8 has been removed in versions 0.6.4 and 0.6.9, respectively. Ensure you are using Python 3.9 or newer.
- breaking The `sudachipy link` command, used for managing dictionary paths, was removed in v0.5.2 and later. Dictionary specification methods now rely on `config_path` or `dict_type` arguments to `Dictionary()` or CLI options.
- deprecated Direct imports like `from sudachipy.dictionary import Dictionary` and `from sudachipy.tokenizer import Tokenizer` are deprecated. Import `Dictionary`, `Tokenizer`, and `SplitMode` directly from the top-level `sudachipy` package.
- gotcha SudachiPy requires a dictionary package (e.g., `sudachidict_core`) to be installed separately. It is not included in the main `sudachipy` package.
- gotcha Building SudachiPy from source (common on ARM64 Linux/macOS if no pre-built wheel is available) requires the Rust compiler toolchain and `setuptools-rust` to be installed in your environment.
- gotcha The `mode` parameter in the `Tokenizer.tokenize()` method is deprecated. Pass the analysis mode when creating the `Tokenizer` instance, or use `Morpheme.split()` for multi-level splitting.
- gotcha Dictionary resource path resolution logic changed in v0.6.3. Paths are now resolved in a specific order: absolute paths, relative to config `path`, relative to `resource_dir` param, relative to config file, relative to current directory.
Install
-
pip install sudachipy sudachidict_core
Imports
- Dictionary
from sudachipy import Dictionary
- Tokenizer
from sudachipy import Tokenizer
- SplitMode
from sudachipy import SplitMode
Quickstart
from sudachipy import Dictionary, SplitMode
# Initialize the tokenizer with the default (core) dictionary
tokenizer = Dictionary().create()
text = "すもももももももものうち"
# Tokenize in SplitMode.C (shortest segmentation)
morphemes_c = tokenizer.tokenize(text, SplitMode.C)
print("SplitMode.C:", [m.surface() for m in morphemes_c])
# Tokenize in SplitMode.A (medium segmentation)
morphemes_a = tokenizer.tokenize(text, SplitMode.A)
print("SplitMode.A:", [m.surface() for m in morphemes_a])
# Access morpheme details
if morphemes_c:
first_morpheme = morphemes_c[0]
print(f"\nFirst morpheme (C): {first_morpheme.surface()}")
print(f" Reading form: {first_morpheme.reading_form()}")
print(f" Part of Speech: {first_morpheme.part_of_speech()}")