pycld2 Language Detection
pycld2 provides Python bindings to Google Chromium's Compact Language Detection library (CLD2). It supports detection for over 165 languages and aims to consolidate the C++ library and its bindings into a single installable Python package. Version 0.42 was released in March 2025, with an irregular release cadence.
Warnings
- breaking Installation commonly fails with 'Failed building wheel for pycld2' errors, particularly on non-standard architectures (e.g., ARM/aarch64) or Windows, due to missing C/C++ compilers or Python development headers.
- gotcha The `detect()` function strictly requires UTF-8 encoded `bytes` or a `str` as input. Passing bytes encoded in other formats (e.g., Latin-1) will raise a `pycld2.error`.
- gotcha Setting the `debugScoreAsQuads` parameter to `True` in `detect()` can significantly impact performance, potentially causing a 2x performance hit.
- deprecated The `hintEncoding` parameter in the `detect()` function is currently not working and provides no biasing hint to the detector.
Install
-
pip install pycld2 -
sudo apt-get install build-essential python3-dev # On Debian/Ubuntu # On macOS: Install Xcode Command Line Tools (xcode-select --install) # On Windows: Install MSVC Build Tools (part of Visual Studio Community)
Imports
- detect
import pycld2 as cld2 isReliable, textBytesFound, details = cld2.detect(text)
Quickstart
import pycld2 as cld2
# Example 1: Basic detection
text_russian = "а неправильный формат идентификатора дн назад"
isReliable, textBytesFound, details = cld2.detect(text_russian)
print(f"Text: '{text_russian}'")
print(f"Is reliable: {isReliable}")
print(f"Detected language: {details[0][0]} ({details[0][1]})")
print(f"Details: {details}")
print('\n---\n')
# Example 2: Detecting multiple languages and getting vectors
text_mixed = """France is the largest country in Western Europe. A accès aux chiens et aux frontaux qui lui ont été il peut consulter. The quick brown fox jumped over the lazy dog."""
isReliable, textBytesFound, details, vectors = cld2.detect(
text_mixed,
returnVectors=True
)
print(f"Text: '{text_mixed}'")
print(f"Is reliable: {isReliable}")
print(f"Detected language (summary): {details[0][0]} ({details[0][1]})")
print(f"Segment language vectors: {vectors}")