Charset Normalizer
Charset-normalizer is a truly universal charset encoding detector for Python. It detects the encoding of raw bytes/files using a heuristic, non-training-based approach and can optionally identify the spoken language of the content. All IANA character set names supported by CPython codecs are supported. The library also ships a `normalizer` CLI tool and a drop-in `detect()` shim for Chardet migration. Current version is 3.4.6 (released March 2026); releases follow Semantic Versioning with frequent minor/patch cadence.
Warnings
- breaking Class aliases CharsetNormalizerMatch, CharsetNormalizerMatches, CharsetDetector, and CharsetDoctor were removed in 3.0. Code referencing these names will raise ImportError or AttributeError.
- breaking Python 3.6 support was dropped in 3.1.0, and Python 3.5 support was dropped in 2.1.0. Installing 3.x on Python 3.6 is unsupported.
- gotcha detect() is the legacy Chardet-compatible shim and is officially deprecated. It also lowers confidence automatically for small byte samples (3.4.3+), so results on short inputs may differ from Chardet.
- gotcha Feeding truncated or incomplete multi-byte byte sequences (e.g. a partial UTF-16 or UTF-32 file) will likely produce incorrect or empty detection results. The library is not designed for streaming partial payloads.
- gotcha from_bytes/from_path return a CharsetMatches container, not a string or a single result. Calling str() directly on the container gives unexpected output. Always call .best() first, then check for None.
- gotcha The import name uses an underscore (charset_normalizer) but the PyPI/install name uses a hyphen (charset-normalizer). Using import charset-normalizer raises a SyntaxError.
- deprecated Internal module charset_normalizer.assets was moved into charset_normalizer.constant in 3.3.x. Any code importing from charset_normalizer.assets directly will break on 3.3+.
Install
-
pip install charset-normalizer -
pip install charset-normalizer -U
Imports
- from_bytes
from charset_normalizer import from_bytes
- from_path
from charset_normalizer import from_path
- from_fp
from charset_normalizer import from_fp
- detect
from charset_normalizer import detect
- is_binary
from charset_normalizer import is_binary
- CharsetMatches
from charset_normalizer.models import CharsetMatches
- CharsetMatch
from charset_normalizer.models import CharsetMatch
Quickstart
from charset_normalizer import from_bytes, from_path, detect
# --- from raw bytes ---
raw = b'\xff\xfe' + 'Hello, world!'.encode('utf-16-le')
results = from_bytes(raw)
best = results.best()
if best is not None:
print('Encoding:', best.encoding) # e.g. 'utf_16'
print('Language:', best.language) # e.g. 'English' or ''
print('Decoded :', str(best)) # decoded unicode string
else:
print('Could not detect encoding (possibly binary data)')
# --- from a file path ---
# results2 = from_path('./data/sample.txt')
# print(str(results2.best()))
# --- Chardet-compatible legacy shim (deprecated but stable) ---
result = detect(raw)
print(result) # {'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}
if result['encoding']:
decoded = raw.decode(result['encoding'])
print('Legacy decoded:', decoded)