Polyglot

raw JSON →
16.7.4 verified Fri May 01 auth: no python maintenance

Polyglot is a natural language pipeline supporting massive multilingual applications. Version 16.7.4 (latest) provides tokenization, language detection, named entity recognition, sentiment analysis, and word embeddings for over 130 languages. Release cadence is irregular; last update was April 2021.

pip install polyglot
error ImportError: No module named 'polyglot'
cause Polyglot not installed.
fix
pip install polyglot
error ModuleNotFoundError: No module named 'icu'
cause PyICU not installed or missing system ICU libraries.
fix
Install system libicu-dev and run: pip install pyicu
error LookupError: Resource ... not found. Please use the NLTK Downloader to obtain the resource
cause Polyglot uses NLTK-like resource system; required models not downloaded.
fix
Run: python -m polyglot download
breaking Polyglot requires ICU (pyicu) and CLD2 (pycld2) to be installed. Without them, imports fail or produce cryptic errors.
fix Install system packages: libicu-dev, libcld2-dev; then pip install pyicu pycld2.
gotcha Text object's .words, .sentences, .entities are lazy and may raise if downloaded models are missing. Run polyglot.downloader to get required models.
fix Run: python -m polyglot download embeds en
deprecated Sentiment analysis using the Text.sentiment attribute is deprecated in favor of using the explicit polyglot.sentiment module.
fix Use polyglot.sentiment.SentimentAnalyzer instead.

Basic usage: detect language of a string.

from polyglot.text import Text

text = Text("Hello, world!")
print(text.language)  # Language detected