JSON Stream Rust Tokenizer
A faster tokenizer for the `json-stream` Python library, this package ports `json-stream`'s internal tokenizer to Rust using PyO3, offering a significant parsing speedup (4-10x on CPython). As of `json-stream` version 2.0+, it is automatically detected and used by default, making explicit installation or usage generally unnecessary. It targets Python versions <3.15,>=3.8 and is actively maintained.
Warnings
- gotcha Starting with `json-stream` version 2.0, this tokenizer is used automatically if available. Explicitly installing `json-stream-rs-tokenizer` or passing `RustTokenizer` to `json_stream.load()` is usually unnecessary unless you are on an older `json-stream` version or need a specific configuration.
- breaking Installation from source requires a Rust toolchain. If a prebuilt wheel is not available for your platform and Python version, the installation process will attempt to build from source. A failed Rust build might still report a successful Python package installation, but `RustTokenizer` will not be importable.
- gotcha When installed in editable/development mode, the Rust library might be compiled in debug mode, which can make it *slower* than the pure-Python tokenizer.
- gotcha For PyPy, the performance improvement from `json-stream-rs-tokenizer` is significantly lower (1.0-1.5x) compared to CPython (4-10x).
- gotcha When parsing mixed data with `json-stream.load(..., correct_cursor=True)`, keeping track of the exact stream position for un-seekable streams incurs a significant performance cost.
Install
-
pip install json-stream-rs-tokenizer
Imports
- RustTokenizer
from json_stream_rs_tokenizer import RustTokenizer
- load
from json_stream_rs_tokenizer import load
Quickstart
from io import StringIO
from json_stream import load as json_stream_load
from json_stream_rs_tokenizer import RustTokenizer
# Example JSON data
json_buf = StringIO('{ "a": [1,2,3,4], "b": [5,6,7] }')
# Explicitly use the Rust tokenizer with json_stream.load
d = json_stream_load(json_buf, tokenizer=RustTokenizer)
print(f"Keys: {list(d.keys())}") # Output: Keys: ['a', 'b']
for k, l in d.items():
print(f"{k}: {' '.join(str(n) for n in l)}")
# Expected output for 'a': a: 1 2 3 4
# Alternatively, use the convenience wrapper from json_stream_rs_tokenizer
json_buf_alt = StringIO('{ "x": "hello", "y": "world" }')
from json_stream_rs_tokenizer import load as rs_load
data_alt = rs_load(json_buf_alt)
print(f"Value of x: {data_alt['x']}") # Output: Value of x: hello