py3langid
py3langid is an actively maintained fork of the original `langid.py` library, specializing in fast and accurate language identification. It is optimized for Python 3 environments, featuring a modernized codebase and improved execution speeds. The library's current version is 0.3.0, with a release cadence that reflects ongoing enhancements and bug fixes.
Common errors
-
ModuleNotFoundError: No module named 'py3langid'
cause The `py3langid` package is not installed or the Python environment where it's installed is not active.fixEnsure the package is installed using `pip install py3langid` in your active Python environment. -
Incorrect language detection for specific texts or less common languages.
cause Language identification models are trained on specific datasets and may perform suboptimally on text with unusual characteristics, code-switching, or languages underrepresented in the training data (e.g., Romanized Indian languages).fixReview the confidence score returned by `classify()`. For ambiguous cases, consider using `langid.rank(text)` to see the distribution of probabilities across multiple languages. If normalizing probabilities, ensure `norm_probs=True` is explicitly set if using `LanguageIdentifier`. For highly specific use cases, a custom-trained model might be necessary, though `py3langid` training scripts are Python 2-only. -
Classifier returns large negative numbers or unexpected probability values instead of 0-1 range.
cause By default, `py3langid` returns log-probabilities for performance. These are not normalized to a 0-1 range unless explicitly requested.fixWhen initializing `LanguageIdentifier`, pass `norm_probs=True` to enable normalization: `identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)`. The default `langid.classify()` does not normalize probabilities to 0-1 directly.
Warnings
- breaking Support for Python 3.6 and 3.7 was dropped in py3langid v0.3.0. Users on these older Python versions will need to upgrade their Python interpreter or stick to py3langid v0.2.x.
- breaking The default Numpy data type for feature vectors changed from `uint32` to `uint16` in v0.2.0 for performance optimization. While generally transparent, this could affect applications sensitive to exact data types or those comparing results with older versions.
- gotcha The original `langid.py` (and by extension `py3langid`) training scripts remain Python 2-only. Users expecting to retrain models with custom data using the provided tools might encounter compatibility issues with Python 3.
Install
-
pip install py3langid
Imports
- classify
import py3langid as langid langid.classify(text)
- LanguageIdentifier
from py3langid.langid import LanguageIdentifier, MODEL_FILE
Quickstart
import py3langid as langid
text_en = 'This text is in English.'
lang, prob = langid.classify(text_en)
print(f"Text: '{text_en}' -> Language: {lang}, Probability: {prob}")
text_de = 'Dieser Text ist auf Deutsch.'
lang, prob = langid.classify(text_de)
print(f"Text: '{text_de}' -> Language: {lang}, Probability: {prob}")
# Example with probability normalization
from py3langid.langid import LanguageIdentifier, MODEL_FILE
identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
text_norm = 'This should be enough text.'
lang_norm, prob_norm = identifier.classify(text_norm)
print(f"Text (normalized): '{text_norm}' -> Language: {lang_norm}, Normalized Probability: {prob_norm}")