textacy

0.13.0 verified Fri May 01 auth: no python

textacy is a Python library for NLP pre- and post-processing built on top of spaCy. Version 0.13.0, requires Python >=3.9. Released irregularly, with focuses on text extraction, tokenization, similarity, and topic modeling.

pip install textacy

Common errors

error ModuleNotFoundError: No module named 'textacy.extract' ↓

cause In textacy 0.13.0, the extract module moved to `textacy.extract.ngrams` etc.

fix

Use from textacy.extract import ngrams instead of from textacy import extract.

error TypeError: make_spacy_doc() missing 1 required positional argument: 'nlp' ↓

cause Passed a string model name instead of a loaded spaCy Language object.

fix

Load model first: nlp = spacy.load('en_core_web_sm'), then make_spacy_doc(text, nlp).

error AttributeError: 'str' object has no attribute 'noun_chunks' ↓

cause Attempting to call noun_chunks on raw string; TextDoc not used.

fix

Create a TextDoc from a spaCy Doc: doc = nlp(text); text_doc = TextDoc(doc).

Warnings

breaking textacy 0.13.0 removed many top-level API functions (e.g., `textacy.extract.*`, `textacy.io.*`). Use `TextDoc` methods or dedicated submodules like `textacy.extract.ngrams`. ↓

fix Replace old imports: e.g., `from textacy.extract import ngrams` instead of `textacy.extract.ngrams`.

deprecated `textacy.preprocess` module is deprecated in favor of `TextDoc` methods for text cleaning. ↓

fix Use `text_doc.preprocess_text(...)` or `text_doc.replace(...)`.

gotcha `make_spacy_doc` requires a spaCy `Language` object, not a raw model name. Passing a string will raise TypeError. ↓

fix Always pass a loaded nlp pipeline: `make_spacy_doc(text, nlp)`.

gotcha `TextDoc` constructor expects a `spacy.tokens.Doc`, not raw text. You must call `nlp(text)` or `make_spacy_doc(text, nlp)` first. ↓

fix Create a Doc first: `doc = nlp(text)` then `TextDoc(doc)`.

Install

pip install textacy[lang]

Imports

TextDoc
wrong
```
from textacy.doc import Doc
```
correct
```
from textacy import TextDoc
```
Older versions used `Doc` from `textacy.doc`, but in 0.13.0 the class is `TextDoc` imported directly from `textacy`.
Corpus
```
from textacy import Corpus
```

Quickstart

Basic usage: create a TextDoc from a spaCy Doc and extract noun chunks.

import spacy
import textacy

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Create TextDoc from text
text = "The quick brown fox jumps over the lazy dog."
doc = textacy.make_spacy_doc(text, nlp)  # returns spacy.tokens.Doc
text_doc = textacy.TextDoc(doc)

# Extract noun chunks
chunks = list(text_doc.noun_chunks)
print(chunks[:2])