HojiChar

raw JSON →
0.16.0 verified Mon Apr 27 auth: no python

HojiChar is a text preprocessing management system for Python, providing a pipeline API inspired by Compose/Filter patterns to clean, filter, and transform text data, with built-in support for deduplication, JSON loading/dumping, and asynchronous processing. Current version: 0.16.0, released Apr 2025; follows a monthly release cadence.

pip install hojichar
error ModuleNotFoundError: No module named 'hojichar.core'
cause Trying to import from old internal module structure that was restructured in v0.10.0.
fix
Use the top-level public API: from hojichar import Compose instead of from hojichar.core import Compose.
error AttributeError: module 'hojichar' has no attribute 'AsyncCompose'
cause Using a version older than v0.14.0 where AsyncCompose was introduced.
fix
Upgrade to hojichar >= 0.14.0: pip install 'hojichar>=0.14.0'.
error ImportError: cannot import name 'JSONDumper' from 'hojichar.filters'
cause JSONDumper was moved to `hojichar.document_filters` in v0.10.0.
fix
Use from hojichar.document_filters import JSONDumper.
error segfault (Fatal Python error: Segmentation fault) when using DiscardTooManyNouns on large text
cause fugashi parsing a very long text without parse length limit can crash.
fix
Set max_parse_chars parameter, e.g., DiscardTooManyNouns(max_parse_chars=500000).
breaking In v0.15.0, the deduplication module was overhauled. GenerateDedupLSH now uses the Rust-based `rensa` engine. The old `MinHash` and `LSH` classes were removed. If you relied on the previous Python-only implementation, you must update imports and usage.
fix Use `from hojichar.filters.deduplication import GenerateDedupLSH` and ensure `rensa` is installed (pip install rensa). For in-memory dedup without Rust, use `InlineDeduplicator`.
deprecated Statistics properties on filters (e.g., `.stats`) are deprecated since v0.15.3. Use the new `stats` module or access via pipeline-level statistics.
fix Migrate to using `pipeline.stats` or individual filter stats via `filter.get_stats()`.
gotcha The `JSONDumper` by default excludes extras key `'__init_stats'` from output. If you need that metadata, set `export_extras=True` and explicitly manage the extras dict.
fix Use `JSONDumper(export_extras=True)` and ensure the extras dict does not contain `'__init_stats'` if you want it exported, or override the dumper.
gotcha Japanese-language filters like `DiscardTooManyNouns` and `WordRepetitionRatioFilter` may cause segfaults on very large texts if `max_parse_chars` is not set. Defaults were adjusted in v0.15.1, but if you encounter segmentation faults, explicitly set a limit.
fix Set `max_parse_chars` parameter (e.g., `DiscardTooManyNouns(max_parse_chars=1000000)`).

Build a text preprocessing pipeline that reads JSONL, filters Japanese text, discards documents with too many or too few letters, and outputs JSONL with metadata.

from hojichar import Compose, document_filters
from hojichar.filters import AcceptJapaneseFilter, DiscardTooManyFilter, DiscardTooFewLettersFilter

pipeline = Compose([
    document_filters.JSONLoader(),
    AcceptJapaneseFilter(),
    DiscardTooManyFilter(max_filtered_num=10000),
    DiscardTooFewLettersFilter(min_letters=10),
    document_filters.JSONDumper(export_extras=True),
])

with open('input.jsonl', 'r') as f:
    results = pipeline(f.read())