cChardet - High-speed Universal Character Encoding Detector

2.1.7 · active · verified Mon Apr 13

cChardet is a high-speed universal character encoding detector implemented as a C extension for Python. It provides fast and accurate detection of text encoding, leveraging the underlying `uchardet` library (a port of Mozilla's `chardet`). The current stable version is 2.1.7, with alpha releases for 2.2.0 indicating ongoing development and support for newer Python versions.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates basic character encoding detection using `cchardet.detect()` for a single byte string and `UniversalDetector` for streaming data. `detect()` returns a dictionary with 'encoding', 'confidence', and 'language'.

import cchardet

# Example 1: Detect encoding of a simple byte string
data = 'これは日本語です'.encode('shift_jis')
result = cchardet.detect(data)
print(f"Detected encoding: {result['encoding']}, confidence: {result['confidence']:.2f}")

# Example 2: Using UniversalDetector for streaming data
from cchardet import UniversalDetector

detector = UniversalDetector()
for line in [b'Hello, world!', b'\xcf\x84\xce\xb7\xce\xbd \xce\xba\xce\xb1\xce\xbb\xce\xb7\xce\xbc\xce\xb5\xce\xbd \xcf\x81\xce\xb1!']:
    detector.feed(line)
detector.close()
streaming_result = detector.result
print(f"Streaming detected encoding: {streaming_result['encoding']}, confidence: {streaming_result['confidence']:.2f}")

view raw JSON →