MUTF-8 Encoder & Decoder

1.0.6 · maintenance · verified Sat Apr 11

This package provides fast pure-Python and optional C implementations for encoding and decoding MUTF-8 and CESU-8 character encodings. MUTF-8 is a variant of UTF-8 primarily encountered in Java Virtual Machine (JVM) contexts. It offers significant performance gains with its C extension, falling back to a pure-Python version if the C extension cannot be built. The current version is 1.0.6, released in late 2021, and the project is in a maintenance phase.

Warnings

gotcha MUTF-8 is a specific variant of UTF-8, primarily used in Java environments. It differs from standard UTF-8 in two key ways: the null character (U+0000) is encoded as a two-byte sequence (`0xC0 0x80` instead of `0x00`), and supplementary characters (code points above U+FFFF) are encoded as two three-byte sequences (via UTF-16 surrogate pairs) instead of a single four-byte sequence. Using Python's built-in `utf-8` codecs for MUTF-8 data will lead to incorrect results.
Fix: Always use `mutf8.encode_modified_utf8` and `mutf8.decode_modified_utf8` when working with MUTF-8 encoded data, especially when interfacing with Java systems.
gotcha The `mutf8` library provides a C extension for significant performance improvements (20x to 40x faster) over its pure-Python implementation. If a C99-compatible compiler is not available during installation, the library will silently fall back to the slower pure-Python version. This can lead to unexpected performance bottlenecks.
Fix: Ensure a C99-compatible compiler is installed and available in your environment before installing `mutf8` to leverage the performance benefits of the C extension. Check installation logs for successful C extension compilation.
deprecated Versions of `mutf8` prior to `1.0.3` provided less precise and less descriptive `UnicodeDecodeErrors`. This made debugging issues with malformed MUTF-8 input more challenging.
Fix: Upgrade to `mutf8` version `1.0.3` or newer to benefit from improved error reporting and more accurate error locations in `UnicodeDecodeErrors`.
breaking Support for Python 3.5 has been dropped in recent versions of `mutf8`. Attempting to install or use newer versions on Python 3.5 will likely fail.
Fix: Upgrade your Python environment to version 3.6 or newer to continue using `mutf8`.

Install

pip install mutf8 Install stable version

Imports

encode_modified_utf8
```
from mutf8 import encode_modified_utf8
```
decode_modified_utf8
```
from mutf8 import decode_modified_utf8
```

Quickstart

This quickstart demonstrates how to encode a Python string (including one with a null character) into MUTF-8 bytes and then decode it back using the `mutf8` library. MUTF-8 handles null characters and supplementary characters differently than standard UTF-8.

from mutf8 import encode_modified_utf8, decode_modified_utf8

# A string with a null character, which MUTF-8 handles differently
original_string = "Hello, \u0000 World!"

# Encode the string to MUTF-8 bytes
mutf8_bytes = encode_modified_utf8(original_string)
print(f"Encoded MUTF-8 bytes: {mutf8_bytes!r}")

# Decode the MUTF-8 bytes back to a Python unicode string
decoded_string = decode_modified_utf8(mutf8_bytes)
print(f"Decoded string: {decoded_string!r}")

# Example with a supplementary character (encoded as surrogate pairs in MUTF-8)
sup_char_string = "\U0001F600"
mutf8_sup_char_bytes = encode_modified_utf8(sup_char_string)
print(f"Encoded supplementary char: {mutf8_sup_char_bytes!r}")
decoded_sup_char_string = decode_modified_utf8(mutf8_sup_char_bytes)
print(f"Decoded supplementary char: {decoded_sup_char_string!r}")

view raw JSON →