Janome: Japanese Morphological Analyzer
Janome is a Japanese morphological analysis engine (or tokenizer, POS-tagger) written in pure Python, including a built-in dictionary and language model. It aims to be easy to install and provides concise, well-designed APIs for various Python applications. Janome uses mecab-ipadic-2.7.0-20070801 as its built-in dictionary. The current version is 0.5.0, released in July 2023, with a release cadence of approximately 6-18 months between major versions.
Common errors
-
MemoryError: Cannot allocate memory
cause During `pip install janome`, the process of compiling the internal dictionary requires a significant amount of RAM (500-600MB). Insufficient memory leads to this error.fixEnsure your environment has at least 2GB of free RAM before running `pip install janome`. If on a resource-constrained system, consider increasing swap space or using a more powerful machine for installation. -
ModuleNotFoundError: No module named 'janome.tokenizer'
cause The Janome library is either not installed, or the import path for `Tokenizer` is incorrect. The library's main components reside in submodules.fixFirst, verify installation with `pip show janome`. If not installed, run `pip install janome`. Ensure you are importing `Tokenizer` from `janome.tokenizer` as shown in the quickstart, not directly from `janome`. -
AttributeError: 'str' object has no attribute 'surface'
cause This typically occurs when you are iterating over tokens with `wakati=True` (word segmentation mode), which returns strings, but then trying to access `Token` object attributes like `token.surface` or `token.part_of_speech`.fixIf you need `Token` objects with full morphological details, do not pass `wakati=True` to the `tokenize()` method or the `Tokenizer` constructor. If you *do* want `wakati-gaki` (list of strings), process the output as strings. Example: `for word in t.tokenize(text, wakati=True): print(word)`.
Warnings
- gotcha Installation requires significant RAM (500-600 MB) for dictionary compilation. Systems with limited memory might encounter `MemoryError` during `pip install`.
- gotcha The `Analyzer` module and its filters are considered experimental. Its class/method interfaces may be modified in future releases.
- breaking Versions prior to 0.4.2 had non-deterministic behavior in `Tokenizer` for some inputs, which could lead to inconsistent analysis results.
- breaking Older versions (prior to 0.4.2) could lead to a 'Too much open files' error due to non-singleton system dictionary instances, especially in long-running processes or when creating many `Tokenizer` instances.
- gotcha If you only need 'wakati-gaki' (word segmentation) mode, initializing `Tokenizer(wakati=True)` can reduce memory usage by about 50MB as it loads only minimum system dictionary data. If `wakati=True` is passed to the constructor, the `tokenize()` method will *always* operate in `wakati-gaki` mode, ignoring `wakati=False` in the method call.
Install
-
pip install janome
Imports
- Tokenizer
from janome.tokenizer import Tokenizer
- Analyzer
from janome import Analyzer
from janome.analyzer import Analyzer
- CharFilters (e.g., UnicodeNormalizeCharFilter)
from janome.charfilter import UnicodeNormalizeCharFilter
- TokenFilters (e.g., CompoundNounFilter)
from janome.tokenfilter import CompoundNounFilter
Quickstart
from janome.tokenizer import Tokenizer
t = Tokenizer()
text = 'すもももももももものうち'
for token in t.tokenize(text):
print(token)
# Example of 'wakati-gaki' mode (surface forms only)
# tokens_wakati = t.tokenize(text, wakati=True)
# print(tokens_wakati)