{"id":5291,"library":"langid","title":"langid.py: Language Identification","description":"langid.py is a standalone Language Identification (LangID) tool for Python. It comes pre-trained on 97 languages and is designed for fast, domain-insensitive classification with minimal dependencies. It can be used as a simple Python library or deployed as a web service. The current version is 1.1.6, and while the original project maintains stability, a community-driven fork (`py3langid`) exists for modern Python 3 performance optimizations.","status":"active","version":"1.1.6","language":"en","source_language":"en","source_url":"https://github.com/saffsd/langid.py","tags":["language identification","nlp","text processing","natural language processing"],"install":[{"cmd":"pip install langid","lang":"bash","label":"Install via pip"}],"dependencies":[{"reason":"Required for numerical computations and model handling within the library.","package":"numpy","optional":false}],"imports":[{"symbol":"langid","correct":"import langid"}],"quickstart":{"code":"import langid\n\ntext1 = \"This is a sample text in English.\"\ntext2 = \"Ceci est un exemple de texte en français.\"\ntext3 = \"Dies ist ein Beispieltext auf Deutsch.\"\n\nprint(f\"'{text1}' -> {langid.classify(text1)}\")\nprint(f\"'{text2}' -> {langid.classify(text2)}\")\nprint(f\"'{text3}' -> {langid.classify(text3)}\")\n\n# To get normalized probabilities (0-1 range):\nfrom langid.langid import LanguageIdentifier, MODEL_FILE\n\nidentifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)\nprint(f\"Normalized for '{text1}' -> {identifier.classify(text1)}\")","lang":"python","description":"The `langid.classify()` function takes a string and returns a tuple containing the predicted ISO 639-1 language code and an unnormalized log-probability estimate. For normalized probabilities (0-1), a `LanguageIdentifier` instance must be explicitly configured."},"warnings":[{"fix":"To obtain normalized probabilities (0-1), you must explicitly initialize a `LanguageIdentifier` instance with `norm_probs=True`. Example: `from langid.langid import LanguageIdentifier, MODEL_FILE; identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True); identifier.classify(text)`.","message":"The `langid.classify()` function, by default, returns an unnormalized log-probability estimate, not a standard probability (0-1 range). Interpreting this value directly as a confidence score between 0 and 1 is incorrect.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Users needing to train custom models in modern Python environments should be aware that the original `langid` library's training tools are not Python 3 compatible. Consider using the `py3langid` fork, which has adapted training scripts (though noted as untested), or an alternative library for custom training.","message":"The training tools (`LDfeatureselect.py`, `train.py`) accompanying `langid.py` are explicitly stated to be Python 2-only. Attempting to use them with Python 3 will result in compatibility errors.","severity":"breaking","affected_versions":"All versions when using training tools"},{"fix":"Initialize the `langid` module once within your application (e.g., `import langid`) and then call `langid.classify()` multiple times. For batch processing of files, the library offers a batch mode (`-b` flag) that utilizes multiprocessing.","message":"For optimal performance when processing large numbers of texts, `langid` should be used as a Python library (importing and calling `classify()`) or as a web service. Repeatedly invoking `langid.py` from the command line for each text incurs significant overhead due to repeated model loading.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For new Python 3.6+ projects or performance-critical applications, consider using `py3langid`. It's designed as a drop-in replacement: `pip install py3langid` and then `import py3langid as langid`.","message":"A modernized and significantly faster fork, `py3langid`, exists and is optimized for Python 3.6+. It offers substantial performance improvements (e.g., 25-30x faster model loading, 5-6x faster classification) over the original `langid` package.","severity":"gotcha","affected_versions":"All versions of `langid` on Python 3.6+"},{"fix":"Ensure that all input to `langid.classify()` is a clean Unicode string. Always validate and potentially preprocess inputs, especially when sourcing text from databases or other external systems, to prevent unexpected formatting or encoding issues.","message":"Passing non-string types or strings with unexpected formatting (e.g., a tuple represented as a string, or unhandled encoding issues from external sources like databases) to `langid.classify()` can lead to incorrect classifications or `KeyError` exceptions.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-13T00:00:00.000Z","next_check":"2026-07-12T00:00:00.000Z"}