spaCy: Industrial-strength Natural Language Processing

raw JSON →
3.8.13 verified Tue May 12 auth: no python install: verified

spaCy is an open-source library for advanced Natural Language Processing (NLP) in Python and Cython. It's designed for industrial-strength production use, providing efficient processing of large volumes of text and featuring state-of-the-art neural network models for tasks like tagging, parsing, and named entity recognition. Currently at version 3.8.13, spaCy maintains an active development cycle with frequent releases addressing compatibility, performance, and new features.

pip install spacy
error ModuleNotFoundError: No module named 'spacy'
cause The spaCy library is not installed in the current Python environment.
fix
Install spaCy using pip: pip install spacy.
error AttributeError: module 'spacy' has no attribute 'load'
cause The script's filename is 'spacy.py', causing a conflict with the spaCy module.
fix
Rename the script to avoid naming conflicts, e.g., 'my_script.py'.
error OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package, or a valid path to a data directory.
cause The specified spaCy language model is not installed.
fix
Download the model using: python -m spacy download en_core_web_sm.
error AttributeError: module 'msgpack._unpacker' has no attribute 'unpack'
cause An incompatible version of the 'msgpack' library is installed.
fix
Upgrade 'msgpack' to a compatible version: pip install --upgrade msgpack.
error AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
cause An incompatible version of PyTorch is installed, causing issues with spaCy's GPU support.
fix
Ensure that the installed versions of spaCy and PyTorch are compatible, and that the correct CUDA version is installed.
breaking Python version support changes frequently. For example, spaCy v3.8.8 dropped Python 3.9 support, v3.7 dropped 3.6, while later 3.8.x versions added 3.14 support. Always check `requires_python` for your specific spaCy version.
fix Ensure your Python environment meets the `requires_python` specification for your installed spaCy version. Upgrade or downgrade Python if necessary.
breaking Migration from Pydantic v1 to v2 caused issues with model loading in older v3.8.x releases (e.g., v3.8.12). This was due to changes in how `confection` and `Thinc` handled dependency validation. While patched in v3.8.13, similar deep dependency incompatibilities can recur.
fix Always install the latest patch release (e.g., 3.8.13 for the Pydantic v2 issue). Pin major dependency versions (e.g., Pydantic) to avoid unexpected upgrades, or upgrade spaCy and all its dependencies in a clean virtual environment.
breaking Upgrading from spaCy v2.x to v3.x involves significant API changes, including the configuration system, pipeline architecture, and how models are trained. Existing custom components and trained models will likely require migration.
fix Consult the official spaCy v2.x to v3.x migration guide. Retrain custom models with the new spaCy version and update code to use the new API and configuration patterns.
gotcha Trained models are separate Python packages that must be downloaded after the core spaCy library is installed. Forgetting to download a model (e.g., `en_core_web_sm`) will result in `OSError: [E050] Can't find model` errors when calling `spacy.load()`.
fix After `pip install spacy`, always run `python -m spacy download [model_name]` for the models you intend to use (e.g., `en_core_web_sm`).
gotcha Installed spaCy models must be compatible with your spaCy library version. Incompatible models can lead to unexpected errors or incorrect behavior.
fix After upgrading spaCy, run `python -m spacy validate` to check compatibility of installed models and get recommendations for updates. Retrain any custom models.
deprecated The `spacy project` functionality was moved into a new standalone library called `Weasel` in spaCy v3.7. While `spacy project` commands still work, some spaCy-specific configuration keys (`spacy_version`, `check_requirements`) are deprecated.
fix For new projects, consider using `Weasel` directly if applicable. Update project configuration files to use `WEASEL_CONFIG_OVERRIDES` instead of `SPACY_CONFIG_OVERRIDES` for environment variables.
python -m spacy download en_core_web_sm
python os / libc status wheel install import disk
3.10 alpine (musl) wheel - 2.66s 279.5M
3.10 alpine (musl) - - 2.71s 279.2M
3.10 slim (glibc) wheel 13.2s 2.07s 291M
3.10 slim (glibc) - - 2.02s 290M
3.11 alpine (musl) wheel - 3.49s 298.0M
3.11 alpine (musl) - - 3.90s 297.6M
3.11 slim (glibc) wheel 12.3s 3.20s 311M
3.11 slim (glibc) - - 2.96s 311M
3.12 alpine (musl) wheel - 3.39s 294.5M
3.12 alpine (musl) - - 3.60s 294.0M
3.12 slim (glibc) wheel 11.3s 3.44s 310M
3.12 slim (glibc) - - 3.74s 310M
3.13 alpine (musl) wheel - 2.88s 293.3M
3.13 alpine (musl) - - 2.96s 292.7M
3.13 slim (glibc) wheel 11.2s 2.89s 308M
3.13 slim (glibc) - - 2.98s 307M
3.9 alpine (musl) build_error - - - -
3.9 alpine (musl) - - - -
3.9 slim (glibc) build_error - 4.1s - -
3.9 slim (glibc) - - - -

This quickstart demonstrates how to load a pre-trained English language model and use it to process text. It shows tokenization, lemmatization, part-of-speech tagging, dependency parsing, and named entity recognition. Remember that models must be downloaded separately after installing the spaCy library.

import spacy

# Load a pre-trained English pipeline
# Make sure to run `python -m spacy download en_core_web_sm` first
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Iterate over tokens
for token in doc:
    print(f"{token.text:<15} {token.lemma_:<10} {token.pos_:<10} {token.dep_:<10} {token.ent_type_:<10}")

# Access named entities
print("\nNamed Entities:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")