Multi-byte String Decoder
mbstrdecoder is a Python library designed for robust decoding of multi-byte character strings, particularly useful when dealing with unknown or potentially malformed encodings. It aims to prevent `UnicodeDecodeError` exceptions by attempting to decode using various strategies, often leveraging the `chardet` library for encoding detection. The current version is 1.1.4, and it generally follows a minor release cadence driven by Python version support and bug fixes.
Warnings
- breaking Python 3.7 and 3.8 support was dropped in version 1.1.4. Users on these Python versions must upgrade their Python environment or pin mbstrdecoder to an older version (e.g., <1.1.4).
- gotcha Prior to version 1.1.4, there were reported issues where `UnicodeDecodeError` exceptions might not be sent (or propagated) correctly during decoding attempts, leading to silent failures or unexpected behavior in error handling logic.
- gotcha The `chardet` library is a mandatory dependency. While it was optional in very early versions, it became mandatory from v0.8.2 onwards and has specific version requirements (`chardet>=3.0.2,<6.0.0`). Ensure it's installed alongside `mbstrdecoder` to avoid import errors related to encoding detection.
Install
-
pip install mbstrdecoder
Imports
- MbStrDecoder
from mbstrdecoder import MbStrDecoder
Quickstart
from mbstrdecoder import MbStrDecoder
# Example 1: Decode a byte string with known encoding
decoder1 = MbStrDecoder(b"hello\xc2\xa3world", encoding="utf-8")
print(f"Decoded (UTF-8 known): {decoder1.unicode_str}, Encoding: {decoder1.detected_encoding}")
# Example 2: Decode a byte string with unknown encoding (chardet will detect)
decoder2 = MbStrDecoder(b"\xa3123.45") # Assuming some non-UTF8 locale, chardet will try
print(f"Decoded (auto-detect): {decoder2.unicode_str}, Encoding: {decoder2.detected_encoding}")
# Example 3: Handling undecodable bytes gracefully (if any)
decoder3 = MbStrDecoder(b'\xed\xa0\x80some invalid bytes', errors='replace')
print(f"Decoded (replace errors): {decoder3.unicode_str}, Encoding: {decoder3.detected_encoding}")