spaCy: Industrial-strength Natural Language Processing
spaCy is an open-source library for advanced Natural Language Processing (NLP) in Python and Cython. It's designed for industrial-strength production use, providing efficient processing of large volumes of text and featuring state-of-the-art neural network models for tasks like tagging, parsing, and named entity recognition. Currently at version 3.8.13, spaCy maintains an active development cycle with frequent releases addressing compatibility, performance, and new features.
Warnings
- breaking Python version support changes frequently. For example, spaCy v3.8.8 dropped Python 3.9 support, v3.7 dropped 3.6, while later 3.8.x versions added 3.14 support. Always check `requires_python` for your specific spaCy version.
- breaking Migration from Pydantic v1 to v2 caused issues with model loading in older v3.8.x releases (e.g., v3.8.12). This was due to changes in how `confection` and `Thinc` handled dependency validation. While patched in v3.8.13, similar deep dependency incompatibilities can recur.
- breaking Upgrading from spaCy v2.x to v3.x involves significant API changes, including the configuration system, pipeline architecture, and how models are trained. Existing custom components and trained models will likely require migration.
- gotcha Trained models are separate Python packages that must be downloaded after the core spaCy library is installed. Forgetting to download a model (e.g., `en_core_web_sm`) will result in `OSError: [E050] Can't find model` errors when calling `spacy.load()`.
- gotcha Installed spaCy models must be compatible with your spaCy library version. Incompatible models can lead to unexpected errors or incorrect behavior.
- deprecated The `spacy project` functionality was moved into a new standalone library called `Weasel` in spaCy v3.7. While `spacy project` commands still work, some spaCy-specific configuration keys (`spacy_version`, `check_requirements`) are deprecated.
Install
-
pip install spacy -
python -m spacy download en_core_web_sm
Imports
- spacy
import spacy
- Language
from spacy.language import Language
- Doc
from spacy.tokens import Doc
Quickstart
import spacy
# Load a pre-trained English pipeline
# Make sure to run `python -m spacy download en_core_web_sm` first
nlp = spacy.load("en_core_web_sm")
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
# Iterate over tokens
for token in doc:
print(f"{token.text:<15} {token.lemma_:<10} {token.pos_:<10} {token.dep_:<10} {token.ent_type_:<10}")
# Access named entities
print("\nNamed Entities:")
for ent in doc.ents:
print(f"{ent.text} ({ent.label_})")