MUTF-8 Encoder & Decoder
This package provides fast pure-Python and optional C implementations for encoding and decoding MUTF-8 and CESU-8 character encodings. MUTF-8 is a variant of UTF-8 primarily encountered in Java Virtual Machine (JVM) contexts. It offers significant performance gains with its C extension, falling back to a pure-Python version if the C extension cannot be built. The current version is 1.0.6, released in late 2021, and the project is in a maintenance phase.
Warnings
- gotcha MUTF-8 is a specific variant of UTF-8, primarily used in Java environments. It differs from standard UTF-8 in two key ways: the null character (U+0000) is encoded as a two-byte sequence (`0xC0 0x80` instead of `0x00`), and supplementary characters (code points above U+FFFF) are encoded as two three-byte sequences (via UTF-16 surrogate pairs) instead of a single four-byte sequence. Using Python's built-in `utf-8` codecs for MUTF-8 data will lead to incorrect results.
- gotcha The `mutf8` library provides a C extension for significant performance improvements (20x to 40x faster) over its pure-Python implementation. If a C99-compatible compiler is not available during installation, the library will silently fall back to the slower pure-Python version. This can lead to unexpected performance bottlenecks.
- deprecated Versions of `mutf8` prior to `1.0.3` provided less precise and less descriptive `UnicodeDecodeErrors`. This made debugging issues with malformed MUTF-8 input more challenging.
- breaking Support for Python 3.5 has been dropped in recent versions of `mutf8`. Attempting to install or use newer versions on Python 3.5 will likely fail.
Install
-
pip install mutf8
Imports
- encode_modified_utf8
from mutf8 import encode_modified_utf8
- decode_modified_utf8
from mutf8 import decode_modified_utf8
Quickstart
from mutf8 import encode_modified_utf8, decode_modified_utf8
# A string with a null character, which MUTF-8 handles differently
original_string = "Hello, \u0000 World!"
# Encode the string to MUTF-8 bytes
mutf8_bytes = encode_modified_utf8(original_string)
print(f"Encoded MUTF-8 bytes: {mutf8_bytes!r}")
# Decode the MUTF-8 bytes back to a Python unicode string
decoded_string = decode_modified_utf8(mutf8_bytes)
print(f"Decoded string: {decoded_string!r}")
# Example with a supplementary character (encoded as surrogate pairs in MUTF-8)
sup_char_string = "\U0001F600"
mutf8_sup_char_bytes = encode_modified_utf8(sup_char_string)
print(f"Encoded supplementary char: {mutf8_sup_char_bytes!r}")
decoded_sup_char_string = decode_modified_utf8(mutf8_sup_char_bytes)
print(f"Decoded supplementary char: {decoded_sup_char_string!r}")