Charset Normalizer
raw JSON → 3.4.6 verified Tue May 12 auth: no python install: verified quickstart: verified
Charset-normalizer is a truly universal charset encoding detector for Python. It detects the encoding of raw bytes/files using a heuristic, non-training-based approach and can optionally identify the spoken language of the content. All IANA character set names supported by CPython codecs are supported. The library also ships a `normalizer` CLI tool and a drop-in `detect()` shim for Chardet migration. Current version is 3.4.6 (released March 2026); releases follow Semantic Versioning with frequent minor/patch cadence.
pip install charset-normalizer Common errors
error AttributeError: partially initialized module 'charset_normalizer' has no attribute 'md__mypyc' (most likely due to a circular import) ↓
cause This error typically indicates a corrupted or incomplete installation of `charset-normalizer`, often due to file shadowing, stale `__pycache__` files, or issues within specific build environments like PyInstaller.
fix
Reinstall the package cleanly using
pip install --force-reinstall charset-normalizer or, if using conda, conda install -c conda-forge charset-normalizer after uninstalling any existing version. error ModuleNotFoundError: No module named 'charset_normalizer' ↓
cause The `charset-normalizer` package is not installed in the active Python environment or is not discoverable in the Python path.
fix
Install the package using
pip install charset-normalizer or conda install charset-normalizer depending on your environment. error normalizer: command not found ↓
cause The `normalizer` CLI tool, which comes with the `charset-normalizer` library, is not found in your system's PATH or was not installed correctly.
fix
Ensure
charset-normalizer is installed in an environment whose scripts directory is in your system's PATH, or run the tool using python -m charset_normalizer. error ImportError: cannot import name 'COMMON_SAFE_ASCII_CHARACTERS' from 'charset_normalizer.constant' ↓
cause This usually points to a version incompatibility or a corrupted installation, often occurring when `charset-normalizer` is used alongside other libraries (like `transformers` or `chardet`) that expect a different internal structure or version.
fix
Cleanly uninstall both
charset-normalizer and any directly dependent libraries (like chardet if present), then reinstall charset-normalizer and the dependent libraries to ensure compatible versions are used. Warnings
breaking Class aliases CharsetNormalizerMatch, CharsetNormalizerMatches, CharsetDetector, and CharsetDoctor were removed in 3.0. Code referencing these names will raise ImportError or AttributeError. ↓
fix Replace with CharsetMatch and CharsetMatches imported from charset_normalizer.models, or use the top-level from_bytes/from_path functions directly.
breaking Python 3.6 support was dropped in 3.1.0, and Python 3.5 support was dropped in 2.1.0. Installing 3.x on Python 3.6 is unsupported. ↓
fix Pin charset-normalizer<3.1 for Python 3.6, or upgrade the Python interpreter.
gotcha detect() is the legacy Chardet-compatible shim and is officially deprecated. It also lowers confidence automatically for small byte samples (3.4.3+), so results on short inputs may differ from Chardet. ↓
fix Migrate to from_bytes(...).best() for new code. Check best() for None before calling str() or accessing .encoding.
gotcha Feeding truncated or incomplete multi-byte byte sequences (e.g. a partial UTF-16 or UTF-32 file) will likely produce incorrect or empty detection results. The library is not designed for streaming partial payloads. ↓
fix Always pass the full byte sequence. Do not slice input for 'performance' — the library already samples internally (5 blocks of 512 bytes by default).
gotcha from_bytes/from_path return a CharsetMatches container, not a string or a single result. Calling str() directly on the container gives unexpected output. Always call .best() first, then check for None. ↓
fix Use: result = from_bytes(raw).best(); text = str(result) if result is not None else ''
gotcha The import name uses an underscore (charset_normalizer) but the PyPI/install name uses a hyphen (charset-normalizer). Using import charset-normalizer raises a SyntaxError. ↓
fix Always use: from charset_normalizer import ...
deprecated Internal module charset_normalizer.assets was moved into charset_normalizer.constant in 3.3.x. Any code importing from charset_normalizer.assets directly will break on 3.3+. ↓
fix Do not import internal modules. Use only the public API: from_bytes, from_path, from_fp, detect, is_binary.
Install
pip install charset-normalizer -U Install compatibility verified last tested: 2026-05-12
python os / libc status wheel install import disk
3.10 alpine (musl) - - 0.27s 18.6M
3.10 alpine (musl) - - 0.26s 18.6M
3.10 slim (glibc) - - 0.16s 19M
3.10 slim (glibc) - - 0.16s 19M
3.11 alpine (musl) - - 0.34s 20.5M
3.11 alpine (musl) - - 0.35s 20.5M
3.11 slim (glibc) - - 0.27s 21M
3.11 slim (glibc) - - 0.28s 21M
3.12 alpine (musl) - - 0.34s 12.4M
3.12 alpine (musl) - - 0.34s 12.4M
3.12 slim (glibc) - - 0.29s 13M
3.12 slim (glibc) - - 0.30s 13M
3.13 alpine (musl) - - 0.36s 12.0M
3.13 alpine (musl) - - 0.33s 12.0M
3.13 slim (glibc) - - 0.33s 12M
3.13 slim (glibc) - - 0.31s 12M
3.9 alpine (musl) - - 0.25s 18.1M
3.9 alpine (musl) - - 0.26s 18.1M
3.9 slim (glibc) - - 0.20s 19M
3.9 slim (glibc) - - 0.20s 19M
Imports
- from_bytes
from charset_normalizer import from_bytes - from_path
from charset_normalizer import from_path - from_fp
from charset_normalizer import from_fp - detect wrong
import chardet; chardet.detect(...)correctfrom charset_normalizer import detect - is_binary
from charset_normalizer import is_binary - CharsetMatches wrong
from charset_normalizer import CharsetNormalizerMatchescorrectfrom charset_normalizer.models import CharsetMatches - CharsetMatch wrong
from charset_normalizer import CharsetNormalizerMatchcorrectfrom charset_normalizer.models import CharsetMatch
Quickstart verified last tested: 2026-04-23
from charset_normalizer import from_bytes, from_path, detect
# --- from raw bytes ---
raw = b'\xff\xfe' + 'Hello, world!'.encode('utf-16-le')
results = from_bytes(raw)
best = results.best()
if best is not None:
print('Encoding:', best.encoding) # e.g. 'utf_16'
print('Language:', best.language) # e.g. 'English' or ''
print('Decoded :', str(best)) # decoded unicode string
else:
print('Could not detect encoding (possibly binary data)')
# --- from a file path ---
# results2 = from_path('./data/sample.txt')
# print(str(results2.best()))
# --- Chardet-compatible legacy shim (deprecated but stable) ---
result = detect(raw)
print(result) # {'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}
if result['encoding']:
decoded = raw.decode(result['encoding'])
print('Legacy decoded:', decoded)