Keyphrase Vectorizers
raw JSON → 0.0.13 verified Fri May 01 auth: no python
Set of vectorizers that extract keyphrases with part-of-speech patterns from a collection of text documents and convert them into a document-keyphrase matrix. Current version 0.0.13, requires Python >=3.7 and spaCy. Releases are intermittent.
pip install keyphrase-vectorizers Common errors
error ModuleNotFoundError: No module named 'keyphrase_vectorizers' ↓
cause The package is not installed or installed under a different name.
fix
Run
pip install keyphrase-vectorizers to install. error ValueError: The 'spacy_pipeline' parameter must be a string or spacy Language object. ↓
cause Passed an unsupported type (e.g., integer) as `spacy_pipeline`.
fix
Pass either a preloaded spaCy Language object or a valid spaCy model name string, e.g., 'en_core_web_sm'.
error OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory. ↓
cause The required spaCy model is not installed.
fix
Run
python -m spacy download en_core_web_sm (or the equivalent for your language). Warnings
gotcha The `spacy_pipeline` parameter expects a spaCy Language object (from spacy.load) or a string (e.g., 'en_core_web_sm'). Passing a string will cause spaCy to load the pipeline each time, leading to performance issues. ↓
fix Always load the model once and pass the nlp object: `nlp = spacy.load('en_core_web_sm'); vectorizer = KeyphraseCountVectorizer(spacy_pipeline=nlp)`
deprecated The parameter `multiprocessing` was renamed to `workers` in v0.0.6. Using `multiprocessing` will raise a TypeError. ↓
fix Use `workers` instead of `multiprocessing` when specifying the number of parallel processes.
breaking In v0.0.9, the default exclusion of certain spaCy pipeline components was removed. This can slow down keyphrase extraction but ensures compatibility with all spaCy pipelines, especially transformer-based ones. ↓
fix If performance degrades, explicitly disable unnecessary pipeline components via the `spacy_exclude` parameter.
Imports
- KeyphraseCountVectorizer wrong
from keyphrase_vectorizers.KeyphraseVectorizers import KeyphraseCountVectorizercorrectfrom keyphrase_vectorizers import KeyphraseCountVectorizer - KeyphraseTfidfVectorizer wrong
from keyphrase_vectorizers.KeyphraseVectorizer import KeyphraseTfidfVectorizercorrectfrom keyphrase_vectorizers import KeyphraseTfidfVectorizer
Quickstart
import spacy
from keyphrase_vectorizers import KeyphraseCountVectorizer
# Download spaCy model if not already present
# spacy.cli.download('en_core_web_sm')
nlp = spacy.load('en_core_web_sm')
docs = [
"Natural language processing enables computers to understand human language.",
"Machine learning is a subset of artificial intelligence."
]
vectorizer = KeyphraseCountVectorizer(spacy_pipeline=nlp)
X = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names_out())
print(X.toarray())