spaCy Language Detection
spacy-language-detection is a fully customizable language detection component for spaCy pipelines, designed for spaCy 3.0 and later. It was forked from `spacy-langdetect` to address issues and ensure compatibility with modern spaCy versions. The library enables detection of language at both document and sentence levels. The current version is 0.2.1, with releases typically focused on bug fixes and ongoing spaCy compatibility.
Warnings
- breaking For spaCy 3.x, adding custom pipeline components requires using `Language.factory` to register a component factory, then `nlp.add_pipe` with the factory name. Direct instantiation like `nlp.add_pipe(LanguageDetector())` (common in spaCy 2.x and older `spacy-langdetect`) will not work.
- gotcha The underlying `langdetect` library (used by default) is non-deterministic without a seed. For reproducible results, pass a `seed` argument to the `LanguageDetector` constructor.
- breaking Token-level language detection was removed in version 0.2 of `spacy-language-detection` to simplify the component and focus on Doc and Span level detection.
- deprecated This library (`spacy-language-detection`) is a fork of the original `spacy-langdetect` project, created to address compatibility issues with spaCy 3.x and add features like the `seed` argument. The original `spacy-langdetect` is less actively maintained and may not work correctly with newer spaCy versions.
Install
-
pip install spacy-language-detection -
python -m spacy download en_core_web_sm
Imports
- LanguageDetector
from spacy_language_detection import LanguageDetector
- Language
from spacy.language import Language
Quickstart
import spacy
from spacy.language import Language
from spacy_language_detection import LanguageDetector
def get_lang_detector(nlp, name):
return LanguageDetector(seed=42) # Using a seed for reproducibility
nlp_model = spacy.load("en_core_web_sm")
Language.factory("language_detector", func=get_lang_detector)
nlp_model.add_pipe('language_detector', last=True)
text = "This is English text. Er lebt mit seinen Eltern und seiner Schwester in Berlin. Yo me divierto todos los días en el parque."
doc = nlp_model(text)
print(f"Document language: {doc._.language}")
for i, sent in enumerate(doc.sents):
print(f"Sentence {i+1}: {sent} -> {sent._.language}")