Lingua Language Detector
Lingua Language Detector is an accurate natural language detection library for Python, suitable for both short text snippets and mixed-language texts. It leverages Rust bindings for high performance and low memory consumption, supporting 75 languages offline. The current version is 2.2.0, with an active development cycle featuring regular minor and patch releases.
Warnings
- breaking Version 2.0.0 completely replaced the pure Python implementation with Rust bindings, changing the underlying API. Code written for `1.x` versions will likely break and require adaptation.
- breaking Starting with version 2.1.1, support for Python 3.10 and 3.11 was dropped. The library now requires Python 3.12 or newer.
- gotcha Using `LanguageDetectorBuilder.with_low_accuracy_mode()` improves performance and reduces memory but significantly decreases detection accuracy for texts shorter than 120 characters.
- gotcha The `detect_multiple_languages_of()` method for mixed-language texts is considered experimental. Its results are highly dependent on the input text and it performs best in high-accuracy mode with longer words.
Install
-
pip install lingua-language-detector
Imports
- Language
from lingua import Language
- LanguageDetectorBuilder
from lingua import LanguageDetectorBuilder
Quickstart
from lingua import Language, LanguageDetectorBuilder
# Build a detector for specific languages
languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
# Detect a single language
text_single = "languages are awesome"
detected_language_single = detector.detect_language_of(text_single)
print(f"Detected language (single): {detected_language_single.name}")
# Detect multiple languages in mixed text (experimental)
text_mixed = "Hello world, comment ça va? Das ist ein Test."
detected_languages_mixed = detector.detect_multiple_languages_of(text_mixed)
print("Detected languages (mixed):")
for result in detected_languages_mixed:
print(f" - {result.language.name}: '{text_mixed[result.start_index:result.end_index]}' ({result.start_index}-{result.end_index})")