langid.py: Language Identification

1.1.6 · active · verified Mon Apr 13

langid.py is a standalone Language Identification (LangID) tool for Python. It comes pre-trained on 97 languages and is designed for fast, domain-insensitive classification with minimal dependencies. It can be used as a simple Python library or deployed as a web service. The current version is 1.1.6, and while the original project maintains stability, a community-driven fork (`py3langid`) exists for modern Python 3 performance optimizations.

Warnings

Install

Imports

Quickstart

The `langid.classify()` function takes a string and returns a tuple containing the predicted ISO 639-1 language code and an unnormalized log-probability estimate. For normalized probabilities (0-1), a `LanguageIdentifier` instance must be explicitly configured.

import langid

text1 = "This is a sample text in English."
text2 = "Ceci est un exemple de texte en français."
text3 = "Dies ist ein Beispieltext auf Deutsch."

print(f"'{text1}' -> {langid.classify(text1)}")
print(f"'{text2}' -> {langid.classify(text2)}")
print(f"'{text3}' -> {langid.classify(text3)}")

# To get normalized probabilities (0-1 range):
from langid.langid import LanguageIdentifier, MODEL_FILE

identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
print(f"Normalized for '{text1}' -> {identifier.classify(text1)}")

view raw JSON →