SudachiPy

0.6.10 · active · verified Sat Apr 11

SudachiPy is a Python binding for Sudachi.rs, a Japanese morphological analyzer implemented in Rust. It provides multi-granular tokenization for Japanese text, handling various linguistic nuances. The current version is 0.6.10, with releases typically occurring every few months to incorporate updates, bug fixes, and Python version support.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize the SudachiPy tokenizer with a default dictionary and perform multi-granular tokenization on Japanese text. It also shows how to access basic information for individual morphemes.

from sudachipy import Dictionary, SplitMode

# Initialize the tokenizer with the default (core) dictionary
tokenizer = Dictionary().create()

text = "すもももももももものうち"

# Tokenize in SplitMode.C (shortest segmentation)
morphemes_c = tokenizer.tokenize(text, SplitMode.C)
print("SplitMode.C:", [m.surface() for m in morphemes_c])

# Tokenize in SplitMode.A (medium segmentation)
morphemes_a = tokenizer.tokenize(text, SplitMode.A)
print("SplitMode.A:", [m.surface() for m in morphemes_a])

# Access morpheme details
if morphemes_c:
    first_morpheme = morphemes_c[0]
    print(f"\nFirst morpheme (C): {first_morpheme.surface()}")
    print(f"  Reading form: {first_morpheme.reading_form()}")
    print(f"  Part of Speech: {first_morpheme.part_of_speech()}")

view raw JSON →