Language Detection (langdetect)

1.0.9 · maintenance · verified Thu Apr 09

langdetect is a pure Python port of Google's language-detection library, offering capabilities to identify the language of a given text. It supports over 50 languages and provides both a single-best guess and a list of probable languages with confidence scores. The current version is 1.0.9, released in 2018, indicating a very stable but slow release cadence, effectively in a maintenance state.

Warnings

gotcha The library frequently raises `langdetect.lang_detect_exception.LangDetectException` for short texts, empty strings, or text that doesn't contain enough linguistic information for reliable detection.
Fix: Always wrap `detect()` and `detect_langs()` calls in a `try...except LangDetectException` block. Consider pre-validating input length or content.
gotcha For very short texts, `langdetect` can produce non-deterministic or inconsistent results due to internal sampling. This can lead to different outputs for the same input across multiple runs if not handled.
Fix: Call `langdetect.set_seed(seed_value)` at the beginning of your program to ensure reproducible results, especially important for testing and debugging.
gotcha The language models used by `langdetect` are based on an older Google project (circa 2014-2018) and are not actively updated. This may lead to less accurate results compared to newer, more sophisticated language detection libraries, especially for modern slang, domain-specific text, or less common languages/dialects.
Fix: For critical applications requiring high accuracy or specific language nuances, evaluate newer libraries like `fasttext` or `cld3` (a port of Google's newer CLD3) that offer more recent models and potentially better performance.

Install

pip install langdetect Install langdetect

Imports

detect
```
from langdetect import detect
```
detect_langs
```
from langdetect import detect_langs
```
set_seed
```
from langdetect import set_seed
```
Use to ensure reproducible results for short texts.
LangDetectException
```
from langdetect import LangDetectException
```
Crucial for handling inputs that cannot be reliably detected.

Quickstart

Demonstrates how to detect the primary language of a text and retrieve a list of possible languages with their confidence scores. It also includes error handling for `LangDetectException`, which is common for short or non-linguistic inputs, and how to use `set_seed` for reproducible results.

from langdetect import detect, detect_langs, set_seed, LangDetectException

# For reproducible results, especially with short texts where probabilities are close
set_seed(0)

text_en = "This is a simple English sentence."
text_fr = "Ceci est une simple phrase française."
text_mixed = "Hallo Welt! This is a mixed text."

try:
    # Detect the primary language
    print(f"'{text_en}' detected as: {detect(text_en)}")
    print(f"'{text_fr}' detected as: {detect(text_fr)}")
    print(f"'{text_mixed}' detected as: {detect(text_mixed)}") # May vary due to mix

    # Get a list of detected languages with their probabilities
    print(f"Probabilities for '{text_en}': {[str(l) for l in detect_langs(text_en)]}")
    print(f"Probabilities for '{text_fr}': {[str(l) for l in detect_langs(text_fr)]}")

    # Handling short/invalid text
    text_short_or_invalid = "a"
    print(f"Attempting to detect '{text_short_or_invalid}'...")
    print(f"Probabilities for '{text_short_or_invalid}': {[str(l) for l in detect_langs(text_short_or_invalid)]}")

except LangDetectException as e:
    # This exception is common for very short or non-linguistic texts
    print(f"An error occurred: {e}. This often happens with very short or unsuitable input text.")

view raw JSON →