MUTF-8 Encoder & Decoder

1.0.6 · maintenance · verified Sat Apr 11

This package provides fast pure-Python and optional C implementations for encoding and decoding MUTF-8 and CESU-8 character encodings. MUTF-8 is a variant of UTF-8 primarily encountered in Java Virtual Machine (JVM) contexts. It offers significant performance gains with its C extension, falling back to a pure-Python version if the C extension cannot be built. The current version is 1.0.6, released in late 2021, and the project is in a maintenance phase.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to encode a Python string (including one with a null character) into MUTF-8 bytes and then decode it back using the `mutf8` library. MUTF-8 handles null characters and supplementary characters differently than standard UTF-8.

from mutf8 import encode_modified_utf8, decode_modified_utf8

# A string with a null character, which MUTF-8 handles differently
original_string = "Hello, \u0000 World!"

# Encode the string to MUTF-8 bytes
mutf8_bytes = encode_modified_utf8(original_string)
print(f"Encoded MUTF-8 bytes: {mutf8_bytes!r}")

# Decode the MUTF-8 bytes back to a Python unicode string
decoded_string = decode_modified_utf8(mutf8_bytes)
print(f"Decoded string: {decoded_string!r}")

# Example with a supplementary character (encoded as surrogate pairs in MUTF-8)
sup_char_string = "\U0001F600"
mutf8_sup_char_bytes = encode_modified_utf8(sup_char_string)
print(f"Encoded supplementary char: {mutf8_sup_char_bytes!r}")
decoded_sup_char_string = decode_modified_utf8(mutf8_sup_char_bytes)
print(f"Decoded supplementary char: {decoded_sup_char_string!r}")

view raw JSON →