KeyBERT
KeyBERT is a minimal and easy-to-use Python library for keyword extraction that leverages state-of-the-art BERT embeddings to identify keywords and keyphrases most similar to a given document. Currently at version 0.9.0, it maintains an active release cadence with frequent updates improving performance, adding new features like LLM integration, and extending model backend support.
Warnings
- breaking Support for Python versions 3.6 and 3.7 was dropped in KeyBERT version 0.8.5. Users on older Python versions must upgrade to Python 3.8 or newer.
- breaking KeyBERT's `KeyLLM` integration with the OpenAI API required updates for `openai>=1`. Older `openai` library versions (e.g., pre-1.0) are incompatible.
- gotcha For large datasets or improved performance, using a GPU is highly recommended. Processing multiple documents in a single `extract_keywords` call significantly speeds up inference by embedding words only once.
- gotcha By default, KeyBERT uses cosine similarity which may result in very similar keywords. To get more diverse keywords, leverage diversification techniques like 'Max Sum Distance' or 'Maximal Marginal Relevance' (MMR).
- gotcha KeyBERT generally doesn't require extensive text preprocessing due to BERT's contextual understanding. However, noisy data (e.g., HTML tags) can negatively impact results.
Install
-
pip install keybert -
pip install keybert[flair] keybert[gensim] keybert[spacy] keybert[use] keybert[hf] -
pip install keybert --no-deps scikit-learn model2vec
Imports
- KeyBERT
from keybert import KeyBERT
- KeyLLM
from keybert import KeyLLM
- OpenAI
from keybert.llm import OpenAI
Quickstart
from keybert import KeyBERT
doc = """
Supervised learning is the machine learning task of learning a function
that maps an input to an output based on example input-output pairs.
It infers a function from labeled training data consisting of a set of
training examples. In supervised learning, each example is a pair
consisting of an input object (typically a vector) and a desired
output value (also called the supervisory signal).
"""
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, top_n=5)
print(keywords)
# Example with diversification (Maximal Marginal Relevance)
keywords_mmr = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3),
stop_words='english',
use_mmr=True, diversity=0.7, top_n=5)
print(keywords_mmr)