BERTopic

0.17.4 · active · verified Wed Apr 15

BERTopic is a topic modeling technique that leverages state-of-the-art transformer models (like BERT) and a class-based TF-IDF procedure to create dense clusters, resulting in easily interpretable topics while retaining important words in their descriptions. It is currently at version 0.17.4 and actively maintained with regular updates and feature enhancements.

Warnings

Install

Imports

Quickstart

This quickstart example demonstrates how to initialize BERTopic, fit it to a dataset (the 20 newsgroups dataset is commonly used), and retrieve information about the discovered topics. The `fit_transform` method processes the documents, returning topic assignments and probabilities.

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Fetch documents (e.g., 20 newsgroups dataset)
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

# Initialize and train BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

# Get information about the frequent topics
topic_info = topic_model.get_topic_info()
print(topic_info.head())

view raw JSON →