Keyphrase Vectorizers

raw JSON →
0.0.13 verified Fri May 01 auth: no python

Set of vectorizers that extract keyphrases with part-of-speech patterns from a collection of text documents and convert them into a document-keyphrase matrix. Current version 0.0.13, requires Python >=3.7 and spaCy. Releases are intermittent.

pip install keyphrase-vectorizers
error ModuleNotFoundError: No module named 'keyphrase_vectorizers'
cause The package is not installed or installed under a different name.
fix
Run pip install keyphrase-vectorizers to install.
error ValueError: The 'spacy_pipeline' parameter must be a string or spacy Language object.
cause Passed an unsupported type (e.g., integer) as `spacy_pipeline`.
fix
Pass either a preloaded spaCy Language object or a valid spaCy model name string, e.g., 'en_core_web_sm'.
error OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
cause The required spaCy model is not installed.
fix
Run python -m spacy download en_core_web_sm (or the equivalent for your language).
gotcha The `spacy_pipeline` parameter expects a spaCy Language object (from spacy.load) or a string (e.g., 'en_core_web_sm'). Passing a string will cause spaCy to load the pipeline each time, leading to performance issues.
fix Always load the model once and pass the nlp object: `nlp = spacy.load('en_core_web_sm'); vectorizer = KeyphraseCountVectorizer(spacy_pipeline=nlp)`
deprecated The parameter `multiprocessing` was renamed to `workers` in v0.0.6. Using `multiprocessing` will raise a TypeError.
fix Use `workers` instead of `multiprocessing` when specifying the number of parallel processes.
breaking In v0.0.9, the default exclusion of certain spaCy pipeline components was removed. This can slow down keyphrase extraction but ensures compatibility with all spaCy pipelines, especially transformer-based ones.
fix If performance degrades, explicitly disable unnecessary pipeline components via the `spacy_exclude` parameter.

Basic usage: load spaCy model, create vectorizer, fit on documents, and inspect extracted keyphrases.

import spacy
from keyphrase_vectorizers import KeyphraseCountVectorizer

# Download spaCy model if not already present
# spacy.cli.download('en_core_web_sm')
nlp = spacy.load('en_core_web_sm')

docs = [
    "Natural language processing enables computers to understand human language.",
    "Machine learning is a subset of artificial intelligence."
]

vectorizer = KeyphraseCountVectorizer(spacy_pipeline=nlp)
X = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names_out())
print(X.toarray())