Charset Normalizer

3.4.6 · active · verified Fri Mar 27

Charset-normalizer is a truly universal charset encoding detector for Python. It detects the encoding of raw bytes/files using a heuristic, non-training-based approach and can optionally identify the spoken language of the content. All IANA character set names supported by CPython codecs are supported. The library also ships a `normalizer` CLI tool and a drop-in `detect()` shim for Chardet migration. Current version is 3.4.6 (released March 2026); releases follow Semantic Versioning with frequent minor/patch cadence.

Warnings

Install

Imports

Quickstart

Detect encoding of raw bytes, decode the content, and use the Chardet-compatible legacy shim.

from charset_normalizer import from_bytes, from_path, detect

# --- from raw bytes ---
raw = b'\xff\xfe' + 'Hello, world!'.encode('utf-16-le')
results = from_bytes(raw)
best = results.best()
if best is not None:
    print('Encoding:', best.encoding)        # e.g. 'utf_16'
    print('Language:', best.language)        # e.g. 'English' or ''
    print('Decoded :', str(best))            # decoded unicode string
else:
    print('Could not detect encoding (possibly binary data)')

# --- from a file path ---
# results2 = from_path('./data/sample.txt')
# print(str(results2.best()))

# --- Chardet-compatible legacy shim (deprecated but stable) ---
result = detect(raw)
print(result)  # {'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}
if result['encoding']:
    decoded = raw.decode(result['encoding'])
    print('Legacy decoded:', decoded)

view raw JSON →