{"id":6545,"library":"bertopic","title":"BERTopic","description":"BERTopic is a topic modeling technique that leverages state-of-the-art transformer models (like BERT) and a class-based TF-IDF procedure to create dense clusters, resulting in easily interpretable topics while retaining important words in their descriptions. It is currently at version 0.17.4 and actively maintained with regular updates and feature enhancements.","status":"active","version":"0.17.4","language":"en","source_language":"en","source_url":"https://github.com/MaartenGr/BERTopic.git","tags":["topic-modeling","nlp","transformers","machine-learning","unsupervised-learning","bert"],"install":[{"cmd":"pip install bertopic","lang":"bash","label":"Base Installation"},{"cmd":"pip install bertopic[flair,gensim,spacy,use]","lang":"bash","label":"Install with specific embedding backends"},{"cmd":"pip install bertopic[vision]","lang":"bash","label":"Install for topic modeling with images"}],"dependencies":[{"reason":"Default for embedding documents.","package":"sentence-transformers"},{"reason":"Default for dimensionality reduction.","package":"umap-learn"},{"reason":"Default for clustering.","package":"hdbscan"},{"reason":"Used for alternative dimensionality reduction (e.g., PCA) if umap-learn is not installed in lightweight mode.","package":"scikit-learn","optional":true}],"imports":[{"symbol":"BERTopic","correct":"from bertopic import BERTopic"}],"quickstart":{"code":"from bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# Fetch documents (e.g., 20 newsgroups dataset)\ndocs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']\n\n# Initialize and train BERTopic model\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# Get information about the frequent topics\ntopic_info = topic_model.get_topic_info()\nprint(topic_info.head())","lang":"python","description":"This quickstart example demonstrates how to initialize BERTopic, fit it to a dataset (the 20 newsgroups dataset is commonly used), and retrieve information about the discovered topics. The `fit_transform` method processes the documents, returning topic assignments and probabilities."},"warnings":[{"fix":"Pin the versions of `bertopic` and its core dependencies (`umap-learn`, `hdbscan`, `sentence-transformers`) in your `requirements.txt` or `pyproject.toml` to match the environment where the model was trained.","message":"BERTopic models are not guaranteed to be compatible across different versions. When saving and loading models, ensure that the BERTopic version, Python version, and dependency versions (e.g., UMAP, HDBSCAN) are identical to prevent errors or unexpected behavior.","severity":"breaking","affected_versions":"<0.17.0"},{"fix":"For reproducible results, initialize UMAP with a `random_state` and pass it to BERTopic: `from umap import UMAP; umap_model = UMAP(random_state=42); topic_model = BERTopic(umap_model=umap_model)`.","message":"The default UMAP algorithm used for dimensionality reduction has a stochastic nature, meaning repeated runs with the same data can yield slightly different topic results.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Calculate embeddings once using `sentence-transformers` (or your preferred embedding model) and then pass them as the `embeddings` argument to `topic_model.fit_transform(docs, embeddings=precomputed_embeddings)`.","message":"For optimal performance, especially with large documents or when iterating over parameters, it is recommended to pre-calculate embeddings and pass them to BERTopic.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For multilingual documents, initialize BERTopic with `topic_model = BERTopic(language=\"multilingual\")`. This will load a multilingual model ('paraphrase-multilingual-MiniLM-L12-v2').","message":"By default, BERTopic initializes with an English-optimized embedding model ('all-MiniLM-L6-v2'). For multilingual datasets, you must explicitly specify the language.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Upgrade your Python environment to Python 3.9 or higher. Python 3.10 and 3.11 are explicitly supported, with 3.13 support added in recent patches.","message":"Support for Python 3.8 was officially dropped with recent BERTopic versions, notably around 0.17.x releases.","severity":"deprecated","affected_versions":">=0.17.0"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}