langid.py: Language Identification
langid.py is a standalone Language Identification (LangID) tool for Python. It comes pre-trained on 97 languages and is designed for fast, domain-insensitive classification with minimal dependencies. It can be used as a simple Python library or deployed as a web service. The current version is 1.1.6, and while the original project maintains stability, a community-driven fork (`py3langid`) exists for modern Python 3 performance optimizations.
Warnings
- gotcha The `langid.classify()` function, by default, returns an unnormalized log-probability estimate, not a standard probability (0-1 range). Interpreting this value directly as a confidence score between 0 and 1 is incorrect.
- breaking The training tools (`LDfeatureselect.py`, `train.py`) accompanying `langid.py` are explicitly stated to be Python 2-only. Attempting to use them with Python 3 will result in compatibility errors.
- gotcha For optimal performance when processing large numbers of texts, `langid` should be used as a Python library (importing and calling `classify()`) or as a web service. Repeatedly invoking `langid.py` from the command line for each text incurs significant overhead due to repeated model loading.
- gotcha A modernized and significantly faster fork, `py3langid`, exists and is optimized for Python 3.6+. It offers substantial performance improvements (e.g., 25-30x faster model loading, 5-6x faster classification) over the original `langid` package.
- gotcha Passing non-string types or strings with unexpected formatting (e.g., a tuple represented as a string, or unhandled encoding issues from external sources like databases) to `langid.classify()` can lead to incorrect classifications or `KeyError` exceptions.
Install
-
pip install langid
Imports
- langid
import langid
Quickstart
import langid
text1 = "This is a sample text in English."
text2 = "Ceci est un exemple de texte en français."
text3 = "Dies ist ein Beispieltext auf Deutsch."
print(f"'{text1}' -> {langid.classify(text1)}")
print(f"'{text2}' -> {langid.classify(text2)}")
print(f"'{text3}' -> {langid.classify(text3)}")
# To get normalized probabilities (0-1 range):
from langid.langid import LanguageIdentifier, MODEL_FILE
identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
print(f"Normalized for '{text1}' -> {identifier.classify(text1)}")