cChardet - High-speed Universal Character Encoding Detector
cChardet is a high-speed universal character encoding detector implemented as a C extension for Python. It provides fast and accurate detection of text encoding, leveraging the underlying `uchardet` library (a port of Mozilla's `chardet`). The current stable version is 2.1.7, with alpha releases for 2.2.0 indicating ongoing development and support for newer Python versions.
Warnings
- breaking Python version support has significantly changed across major and minor releases. Version 2.1.6 dropped Python 2.7. Version 2.1.7 dropped Python 3.5. Future 2.2.x alpha releases indicate dropping support for Python 3.6-3.8 in favor of Python 3.10, 3.11, and 3.12. Always check the required Python version for the specific `cchardet` release you intend to use.
- breaking Version 2.0.0 replaced the underlying `uchardet-enhanced` library with `uchardet`. While both are based on Mozilla's chardet, this change could potentially introduce subtle differences in detection results for specific edge cases or less common encodings.
- gotcha `cchardet` is a C extension. If pre-built wheels are not available for your specific operating system, Python version, and architecture, `pip` will attempt to compile it from source. This requires a C compiler (e.g., GCC, Clang, MSVC) to be installed and properly configured on your system.
- gotcha The `detect` function can sometimes return a lower confidence for ambiguous encodings. A high confidence (e.g., >0.9) generally indicates a reliable detection, but lower confidence values might warrant further validation or fallback mechanisms.
Install
-
pip install cchardet
Imports
- detect
import cchardet cchardet.detect(b'some bytes')
- UniversalDetector
from cchardet import UniversalDetector
Quickstart
import cchardet
# Example 1: Detect encoding of a simple byte string
data = 'これは日本語です'.encode('shift_jis')
result = cchardet.detect(data)
print(f"Detected encoding: {result['encoding']}, confidence: {result['confidence']:.2f}")
# Example 2: Using UniversalDetector for streaming data
from cchardet import UniversalDetector
detector = UniversalDetector()
for line in [b'Hello, world!', b'\xcf\x84\xce\xb7\xce\xbd \xce\xba\xce\xb1\xce\xbb\xce\xb7\xce\xbc\xce\xb5\xce\xbd \xcf\x81\xce\xb1!']:
detector.feed(line)
detector.close()
streaming_result = detector.result
print(f"Streaming detected encoding: {streaming_result['encoding']}, confidence: {streaming_result['confidence']:.2f}")