{"id":3286,"library":"sudachipy","title":"SudachiPy","description":"SudachiPy is a Python binding for Sudachi.rs, a Japanese morphological analyzer implemented in Rust. It provides multi-granular tokenization for Japanese text, handling various linguistic nuances. The current version is 0.6.10, with releases typically occurring every few months to incorporate updates, bug fixes, and Python version support.","status":"active","version":"0.6.10","language":"en","source_language":"en","source_url":"https://github.com/WorksApplications/sudachi.rs/tree/develop/python","tags":["NLP","Japanese","morphological analysis","tokenization","text processing"],"install":[{"cmd":"pip install sudachipy sudachidict_core","lang":"bash","label":"Install SudachiPy and Core Dictionary"}],"dependencies":[{"reason":"Required for dictionary functionality; not bundled with sudachipy.","package":"sudachidict_core","optional":false},{"reason":"Required for building from source, particularly on ARM architectures.","package":"setuptools-rust","optional":true},{"reason":"Rust compiler toolchain required for building from source (e.g., on ARM64 Linux/macOS).","package":"rust","optional":true}],"imports":[{"note":"Importing from sudachipy.dictionary is deprecated as of recent versions.","wrong":"from sudachipy.dictionary import Dictionary","symbol":"Dictionary","correct":"from sudachipy import Dictionary"},{"note":"Importing from sudachipy.tokenizer is deprecated as of recent versions.","wrong":"from sudachipy.tokenizer import Tokenizer","symbol":"Tokenizer","correct":"from sudachipy import Tokenizer"},{"symbol":"SplitMode","correct":"from sudachipy import SplitMode"}],"quickstart":{"code":"from sudachipy import Dictionary, SplitMode\n\n# Initialize the tokenizer with the default (core) dictionary\ntokenizer = Dictionary().create()\n\ntext = \"すもももももももものうち\"\n\n# Tokenize in SplitMode.C (shortest segmentation)\nmorphemes_c = tokenizer.tokenize(text, SplitMode.C)\nprint(\"SplitMode.C:\", [m.surface() for m in morphemes_c])\n\n# Tokenize in SplitMode.A (medium segmentation)\nmorphemes_a = tokenizer.tokenize(text, SplitMode.A)\nprint(\"SplitMode.A:\", [m.surface() for m in morphemes_a])\n\n# Access morpheme details\nif morphemes_c:\n    first_morpheme = morphemes_c[0]\n    print(f\"\\nFirst morpheme (C): {first_morpheme.surface()}\")\n    print(f\"  Reading form: {first_morpheme.reading_form()}\")\n    print(f\"  Part of Speech: {first_morpheme.part_of_speech()}\")","lang":"python","description":"This quickstart demonstrates how to initialize the SudachiPy tokenizer with a default dictionary and perform multi-granular tokenization on Japanese text. It also shows how to access basic information for individual morphemes."},"warnings":[{"fix":"Upgrade your Python environment to 3.9 or a later supported version.","message":"Support for Python 3.6, 3.7, and 3.8 has been removed in versions 0.6.4 and 0.6.9, respectively. Ensure you are using Python 3.9 or newer.","severity":"breaking","affected_versions":">=0.6.4, >=0.6.9"},{"fix":"Refer to the documentation for updated methods of specifying dictionary paths or types (e.g., `Dictionary(dict_type='full')`).","message":"The `sudachipy link` command, used for managing dictionary paths, was removed in v0.5.2 and later. Dictionary specification methods now rely on `config_path` or `dict_type` arguments to `Dictionary()` or CLI options.","severity":"breaking","affected_versions":">=0.5.2"},{"fix":"Change import statements to `from sudachipy import Dictionary, Tokenizer, SplitMode`.","message":"Direct imports like `from sudachipy.dictionary import Dictionary` and `from sudachipy.tokenizer import Tokenizer` are deprecated. Import `Dictionary`, `Tokenizer`, and `SplitMode` directly from the top-level `sudachipy` package.","severity":"deprecated","affected_versions":">=0.6.x"},{"fix":"Always install a dictionary package alongside `sudachipy`, e.g., `pip install sudachipy sudachidict_core`.","message":"SudachiPy requires a dictionary package (e.g., `sudachidict_core`) to be installed separately. It is not included in the main `sudachipy` package.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Install Rust and `setuptools-rust` (e.g., `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y && pip install setuptools-rust`).","message":"Building SudachiPy from source (common on ARM64 Linux/macOS if no pre-built wheel is available) requires the Rust compiler toolchain and `setuptools-rust` to be installed in your environment.","severity":"gotcha","affected_versions":"All versions (when building from source)"},{"fix":"Create different `Tokenizer` instances for different modes, e.g., `tokenizer_a = Dictionary().create(mode=SplitMode.A)`.","message":"The `mode` parameter in the `Tokenizer.tokenize()` method is deprecated. Pass the analysis mode when creating the `Tokenizer` instance, or use `Morpheme.split()` for multi-level splitting.","severity":"gotcha","affected_versions":">=0.6.x"},{"fix":"If custom dictionary paths are used, verify they align with the new resolution order. Consider using the `resource_dir` parameter in `Dictionary()` constructor or `config_path` for explicit control.","message":"Dictionary resource path resolution logic changed in v0.6.3. Paths are now resolved in a specific order: absolute paths, relative to config `path`, relative to `resource_dir` param, relative to config file, relative to current directory.","severity":"gotcha","affected_versions":">=0.6.3"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}