BERTopic
BERTopic is a topic modeling technique that leverages state-of-the-art transformer models (like BERT) and a class-based TF-IDF procedure to create dense clusters, resulting in easily interpretable topics while retaining important words in their descriptions. It is currently at version 0.17.4 and actively maintained with regular updates and feature enhancements.
Warnings
- breaking BERTopic models are not guaranteed to be compatible across different versions. When saving and loading models, ensure that the BERTopic version, Python version, and dependency versions (e.g., UMAP, HDBSCAN) are identical to prevent errors or unexpected behavior.
- gotcha The default UMAP algorithm used for dimensionality reduction has a stochastic nature, meaning repeated runs with the same data can yield slightly different topic results.
- gotcha For optimal performance, especially with large documents or when iterating over parameters, it is recommended to pre-calculate embeddings and pass them to BERTopic.
- gotcha By default, BERTopic initializes with an English-optimized embedding model ('all-MiniLM-L6-v2'). For multilingual datasets, you must explicitly specify the language.
- deprecated Support for Python 3.8 was officially dropped with recent BERTopic versions, notably around 0.17.x releases.
Install
-
pip install bertopic -
pip install bertopic[flair,gensim,spacy,use] -
pip install bertopic[vision]
Imports
- BERTopic
from bertopic import BERTopic
Quickstart
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Fetch documents (e.g., 20 newsgroups dataset)
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# Initialize and train BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
# Get information about the frequent topics
topic_info = topic_model.get_topic_info()
print(topic_info.head())