{"id":21435,"library":"hojichar","title":"HojiChar","description":"HojiChar is a text preprocessing management system for Python, providing a pipeline API inspired by Compose/Filter patterns to clean, filter, and transform text data, with built-in support for deduplication, JSON loading/dumping, and asynchronous processing. Current version: 0.16.0, released Apr 2025; follows a monthly release cadence.","status":"active","version":"0.16.0","language":"python","source_language":"en","source_url":"https://github.com/HojiChar/HojiChar","tags":["text preprocessing","NLP","Japanese","pipeline","data cleaning"],"install":[{"cmd":"pip install hojichar","lang":"bash","label":"Default install"}],"dependencies":[{"reason":"Required for Japanese-language filters like DiscardTooManyNouns and WordRepetitionRatioFilter","package":"fugashi[tagger]","optional":true},{"reason":"Rust-based engine used by GenerateDedupLSH for near-duplicate detection (v0.15.0+)","package":"rensa","optional":true}],"imports":[{"note":"Compose is a top-level export since v0.10.0","wrong":"from hojichar.core import Compose","symbol":"Compose","correct":"from hojichar import Compose"},{"note":"Introduced in v0.14.0","wrong":null,"symbol":"AsyncCompose","correct":"from hojichar import AsyncCompose"},{"note":"Part of document_filters module","wrong":null,"symbol":"JSONLoader","correct":"from hojichar.document_filters import JSONLoader"},{"note":"JSONDumper moved to hojichar.document_filters in v0.10.0","wrong":"from hojichar.filters import JSONDumper","symbol":"JSONDumper","correct":"from hojichar.document_filters import JSONDumper"}],"quickstart":{"code":"from hojichar import Compose, document_filters\nfrom hojichar.filters import AcceptJapaneseFilter, DiscardTooManyFilter, DiscardTooFewLettersFilter\n\npipeline = Compose([\n    document_filters.JSONLoader(),\n    AcceptJapaneseFilter(),\n    DiscardTooManyFilter(max_filtered_num=10000),\n    DiscardTooFewLettersFilter(min_letters=10),\n    document_filters.JSONDumper(export_extras=True),\n])\n\nwith open('input.jsonl', 'r') as f:\n    results = pipeline(f.read())","lang":"python","description":"Build a text preprocessing pipeline that reads JSONL, filters Japanese text, discards documents with too many or too few letters, and outputs JSONL with metadata."},"warnings":[{"fix":"Use `from hojichar.filters.deduplication import GenerateDedupLSH` and ensure `rensa` is installed (pip install rensa). For in-memory dedup without Rust, use `InlineDeduplicator`.","message":"In v0.15.0, the deduplication module was overhauled. GenerateDedupLSH now uses the Rust-based `rensa` engine. The old `MinHash` and `LSH` classes were removed. If you relied on the previous Python-only implementation, you must update imports and usage.","severity":"breaking","affected_versions":">=0.15.0"},{"fix":"Migrate to using `pipeline.stats` or individual filter stats via `filter.get_stats()`.","message":"Statistics properties on filters (e.g., `.stats`) are deprecated since v0.15.3. Use the new `stats` module or access via pipeline-level statistics.","severity":"deprecated","affected_versions":">=0.15.3,<0.17.0"},{"fix":"Use `JSONDumper(export_extras=True)` and ensure the extras dict does not contain `'__init_stats'` if you want it exported, or override the dumper.","message":"The `JSONDumper` by default excludes extras key `'__init_stats'` from output. If you need that metadata, set `export_extras=True` and explicitly manage the extras dict.","severity":"gotcha","affected_versions":">=0.14.1"},{"fix":"Set `max_parse_chars` parameter (e.g., `DiscardTooManyNouns(max_parse_chars=1000000)`).","message":"Japanese-language filters like `DiscardTooManyNouns` and `WordRepetitionRatioFilter` may cause segfaults on very large texts if `max_parse_chars` is not set. Defaults were adjusted in v0.15.1, but if you encounter segmentation faults, explicitly set a limit.","severity":"gotcha","affected_versions":">=0.15.0"}],"env_vars":null,"last_verified":"2026-04-27T00:00:00.000Z","next_check":"2026-07-26T00:00:00.000Z","problems":[{"fix":"Use the top-level public API: `from hojichar import Compose` instead of `from hojichar.core import Compose`.","cause":"Trying to import from old internal module structure that was restructured in v0.10.0.","error":"ModuleNotFoundError: No module named 'hojichar.core'"},{"fix":"Upgrade to hojichar >= 0.14.0: `pip install 'hojichar>=0.14.0'`.","cause":"Using a version older than v0.14.0 where AsyncCompose was introduced.","error":"AttributeError: module 'hojichar' has no attribute 'AsyncCompose'"},{"fix":"Use `from hojichar.document_filters import JSONDumper`.","cause":"JSONDumper was moved to `hojichar.document_filters` in v0.10.0.","error":"ImportError: cannot import name 'JSONDumper' from 'hojichar.filters'"},{"fix":"Set `max_parse_chars` parameter, e.g., `DiscardTooManyNouns(max_parse_chars=500000)`.","cause":"fugashi parsing a very long text without parse length limit can crash.","error":"segfault (Fatal Python error: Segmentation fault) when using DiscardTooManyNouns on large text"}],"ecosystem":"pypi","meta_description":null,"install_score":null,"install_tag":null,"quickstart_score":null,"quickstart_tag":null}