spaCy Language Detection

0.2.1 · active · verified Sun Apr 12

spacy-language-detection is a fully customizable language detection component for spaCy pipelines, designed for spaCy 3.0 and later. It was forked from `spacy-langdetect` to address issues and ensure compatibility with modern spaCy versions. The library enables detection of language at both document and sentence levels. The current version is 0.2.1, with releases typically focused on bug fixes and ongoing spaCy compatibility.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to add the `spacy-language-detection` component to a spaCy 3.x pipeline. It registers a custom language detector factory and adds it as the last component in the pipeline. It then processes a multilingual text and prints the detected language for the entire document and each individual sentence. Ensure you have a spaCy model (e.g., `en_core_web_sm`) downloaded before running.

import spacy
from spacy.language import Language
from spacy_language_detection import LanguageDetector

def get_lang_detector(nlp, name):
    return LanguageDetector(seed=42) # Using a seed for reproducibility

nlp_model = spacy.load("en_core_web_sm")
Language.factory("language_detector", func=get_lang_detector)
nlp_model.add_pipe('language_detector', last=True)

text = "This is English text. Er lebt mit seinen Eltern und seiner Schwester in Berlin. Yo me divierto todos los días en el parque."
doc = nlp_model(text)

print(f"Document language: {doc._.language}")
for i, sent in enumerate(doc.sents):
    print(f"Sentence {i+1}: {sent} -> {sent._.language}")

view raw JSON →