Keyphrase Vectorizers

0.0.13 verified Fri May 01 auth: no python

Set of vectorizers that extract keyphrases with part-of-speech patterns from a collection of text documents and convert them into a document-keyphrase matrix. Current version 0.0.13, requires Python >=3.7 and spaCy. Releases are intermittent.

pip install keyphrase-vectorizers

Common errors

error ModuleNotFoundError: No module named 'keyphrase_vectorizers' ↓

cause The package is not installed or installed under a different name.

fix

Run pip install keyphrase-vectorizers to install.

error ValueError: The 'spacy_pipeline' parameter must be a string or spacy Language object. ↓

cause Passed an unsupported type (e.g., integer) as `spacy_pipeline`.

fix

Pass either a preloaded spaCy Language object or a valid spaCy model name string, e.g., 'en_core_web_sm'.

error OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory. ↓

cause The required spaCy model is not installed.

fix

Run python -m spacy download en_core_web_sm (or the equivalent for your language).

Warnings

gotcha The `spacy_pipeline` parameter expects a spaCy Language object (from spacy.load) or a string (e.g., 'en_core_web_sm'). Passing a string will cause spaCy to load the pipeline each time, leading to performance issues. ↓

fix Always load the model once and pass the nlp object: `nlp = spacy.load('en_core_web_sm'); vectorizer = KeyphraseCountVectorizer(spacy_pipeline=nlp)`

deprecated The parameter `multiprocessing` was renamed to `workers` in v0.0.6. Using `multiprocessing` will raise a TypeError. ↓

fix Use `workers` instead of `multiprocessing` when specifying the number of parallel processes.

breaking In v0.0.9, the default exclusion of certain spaCy pipeline components was removed. This can slow down keyphrase extraction but ensures compatibility with all spaCy pipelines, especially transformer-based ones. ↓

fix If performance degrades, explicitly disable unnecessary pipeline components via the `spacy_exclude` parameter.

Imports

KeyphraseCountVectorizer

wrong

from keyphrase_vectorizers.KeyphraseVectorizers import KeyphraseCountVectorizer

correct

from keyphrase_vectorizers import KeyphraseCountVectorizer

The correct import is directly from the package; nested module import fails.

KeyphraseTfidfVectorizer

wrong

from keyphrase_vectorizers.KeyphraseVectorizer import KeyphraseTfidfVectorizer

correct

from keyphrase_vectorizers import KeyphraseTfidfVectorizer

Common typo in class name; there is no 'KeyphraseVectorizer' class.

Quickstart

Basic usage: load spaCy model, create vectorizer, fit on documents, and inspect extracted keyphrases.

import spacy
from keyphrase_vectorizers import KeyphraseCountVectorizer

# Download spaCy model if not already present
# spacy.cli.download('en_core_web_sm')
nlp = spacy.load('en_core_web_sm')

docs = [
    "Natural language processing enables computers to understand human language.",
    "Machine learning is a subset of artificial intelligence."
]

vectorizer = KeyphraseCountVectorizer(spacy_pipeline=nlp)
X = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names_out())
print(X.toarray())