Language Detection (langdetect)
langdetect is a pure Python port of Google's language-detection library, offering capabilities to identify the language of a given text. It supports over 50 languages and provides both a single-best guess and a list of probable languages with confidence scores. The current version is 1.0.9, released in 2018, indicating a very stable but slow release cadence, effectively in a maintenance state.
Warnings
- gotcha The library frequently raises `langdetect.lang_detect_exception.LangDetectException` for short texts, empty strings, or text that doesn't contain enough linguistic information for reliable detection.
- gotcha For very short texts, `langdetect` can produce non-deterministic or inconsistent results due to internal sampling. This can lead to different outputs for the same input across multiple runs if not handled.
- gotcha The language models used by `langdetect` are based on an older Google project (circa 2014-2018) and are not actively updated. This may lead to less accurate results compared to newer, more sophisticated language detection libraries, especially for modern slang, domain-specific text, or less common languages/dialects.
Install
-
pip install langdetect
Imports
- detect
from langdetect import detect
- detect_langs
from langdetect import detect_langs
- set_seed
from langdetect import set_seed
- LangDetectException
from langdetect import LangDetectException
Quickstart
from langdetect import detect, detect_langs, set_seed, LangDetectException
# For reproducible results, especially with short texts where probabilities are close
set_seed(0)
text_en = "This is a simple English sentence."
text_fr = "Ceci est une simple phrase française."
text_mixed = "Hallo Welt! This is a mixed text."
try:
# Detect the primary language
print(f"'{text_en}' detected as: {detect(text_en)}")
print(f"'{text_fr}' detected as: {detect(text_fr)}")
print(f"'{text_mixed}' detected as: {detect(text_mixed)}") # May vary due to mix
# Get a list of detected languages with their probabilities
print(f"Probabilities for '{text_en}': {[str(l) for l in detect_langs(text_en)]}")
print(f"Probabilities for '{text_fr}': {[str(l) for l in detect_langs(text_fr)]}")
# Handling short/invalid text
text_short_or_invalid = "a"
print(f"Attempting to detect '{text_short_or_invalid}'...")
print(f"Probabilities for '{text_short_or_invalid}': {[str(l) for l in detect_langs(text_short_or_invalid)]}")
except LangDetectException as e:
# This exception is common for very short or non-linguistic texts
print(f"An error occurred: {e}. This often happens with very short or unsuitable input text.")