{"id":7318,"library":"janome","title":"Janome: Japanese Morphological Analyzer","description":"Janome is a Japanese morphological analysis engine (or tokenizer, POS-tagger) written in pure Python, including a built-in dictionary and language model. It aims to be easy to install and provides concise, well-designed APIs for various Python applications. Janome uses mecab-ipadic-2.7.0-20070801 as its built-in dictionary. The current version is 0.5.0, released in July 2023, with a release cadence of approximately 6-18 months between major versions.","status":"active","version":"0.5.0","language":"en","source_language":"en","source_url":"https://github.com/mocobeta/janome","tags":["japanese","nlp","morphological analysis","tokenizer","pos-tagger"],"install":[{"cmd":"pip install janome","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Requires Python 3.7 or newer to run.","package":"Python","optional":false}],"imports":[{"symbol":"Tokenizer","correct":"from janome.tokenizer import Tokenizer"},{"note":"Analyzer is in its own submodule, not directly under the top-level package.","wrong":"from janome import Analyzer","symbol":"Analyzer","correct":"from janome.analyzer import Analyzer"},{"note":"Commonly imported using `from janome.charfilter import *` for convenience, but specific imports are recommended.","symbol":"CharFilters (e.g., UnicodeNormalizeCharFilter)","correct":"from janome.charfilter import UnicodeNormalizeCharFilter"},{"note":"Commonly imported using `from janome.tokenfilter import *` for convenience, but specific imports are recommended.","symbol":"TokenFilters (e.g., CompoundNounFilter)","correct":"from janome.tokenfilter import CompoundNounFilter"}],"quickstart":{"code":"from janome.tokenizer import Tokenizer\n\nt = Tokenizer()\ntext = 'すもももももももものうち'\n\nfor token in t.tokenize(text):\n    print(token)\n\n# Example of 'wakati-gaki' mode (surface forms only)\n# tokens_wakati = t.tokenize(text, wakati=True)\n# print(tokens_wakati)","lang":"python","description":"Initializes the Tokenizer and processes a Japanese sentence, printing each token with its morphological information. An example for 'wakati-gaki' (word segmentation) mode is also included, which returns only surface forms."},"warnings":[{"fix":"Ensure adequate RAM (e.g., 2GB or more) is available during installation. For 32-bit environments, newer versions (0.2.6+) are more optimized.","message":"Installation requires significant RAM (500-600 MB) for dictionary compilation. Systems with limited memory might encounter `MemoryError` during `pip install`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Be aware that code using `Analyzer` might require adjustments in subsequent major versions. Refer to release notes for API changes.","message":"The `Analyzer` module and its filters are considered experimental. Its class/method interfaces may be modified in future releases.","severity":"gotcha","affected_versions":"0.3.4 and later"},{"fix":"Upgrade to Janome 0.4.2 or later to ensure deterministic tokenization.","message":"Versions prior to 0.4.2 had non-deterministic behavior in `Tokenizer` for some inputs, which could lead to inconsistent analysis results.","severity":"breaking","affected_versions":"<0.4.2"},{"fix":"Upgrade to Janome 0.4.2 or later, which ensures the system dictionary is a singleton, preventing this resource exhaustion.","message":"Older versions (prior to 0.4.2) could lead to a 'Too much open files' error due to non-singleton system dictionary instances, especially in long-running processes or when creating many `Tokenizer` instances.","severity":"breaking","affected_versions":"<0.4.2"},{"fix":"For memory optimization, use `t = Tokenizer(wakati=True)` if you exclusively need word segmentation. Otherwise, default to `Tokenizer()` and pass `wakati=True` to `tokenize()` method when needed.","message":"If you only need 'wakati-gaki' (word segmentation) mode, initializing `Tokenizer(wakati=True)` can reduce memory usage by about 50MB as it loads only minimum system dictionary data. If `wakati=True` is passed to the constructor, the `tokenize()` method will *always* operate in `wakati-gaki` mode, ignoring `wakati=False` in the method call.","severity":"gotcha","affected_versions":"0.3.1 and later"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Ensure your environment has at least 2GB of free RAM before running `pip install janome`. If on a resource-constrained system, consider increasing swap space or using a more powerful machine for installation.","cause":"During `pip install janome`, the process of compiling the internal dictionary requires a significant amount of RAM (500-600MB). Insufficient memory leads to this error.","error":"MemoryError: Cannot allocate memory"},{"fix":"First, verify installation with `pip show janome`. If not installed, run `pip install janome`. Ensure you are importing `Tokenizer` from `janome.tokenizer` as shown in the quickstart, not directly from `janome`.","cause":"The Janome library is either not installed, or the import path for `Tokenizer` is incorrect. The library's main components reside in submodules.","error":"ModuleNotFoundError: No module named 'janome.tokenizer'"},{"fix":"If you need `Token` objects with full morphological details, do not pass `wakati=True` to the `tokenize()` method or the `Tokenizer` constructor. If you *do* want `wakati-gaki` (list of strings), process the output as strings. Example: `for word in t.tokenize(text, wakati=True): print(word)`.","cause":"This typically occurs when you are iterating over tokens with `wakati=True` (word segmentation mode), which returns strings, but then trying to access `Token` object attributes like `token.surface` or `token.part_of_speech`.","error":"AttributeError: 'str' object has no attribute 'surface'"}]}