BERTopic: Modern Topic Modeling with Embeddings
BERTopic is a topic modeling library that combines transformer embeddings, dimensionality reduction, density-based clustering, and a class-based TF-IDF representation to discover themes in text collections. Instead of treating documents as bags of words like classical methods, it works on the semantic level, which usually produces more coherent and interpretable topics. This tutorial walks through the full BERTopic pipeline using a practical example: clustering a corpus of customer support tickets.
What Topic Modeling Is and Why BERTopic Helps
Topic modeling is the task of automatically grouping a collection of documents into themes ("topics") and describing each theme with a set of representative words. It is unsupervised: you do not provide labels, the model finds structure on its own.
The traditional approach is Latent Dirichlet Allocation (LDA). LDA treats each document as a mixture of topics and each topic as a distribution over words. It works on raw word counts, so it ignores word order and meaning. In practice this leads to a few recurring problems:
- It struggles with short texts (tweets, tickets, reviews) where word co-occurrence is sparse.
- It requires you to fix the number of topics up front, which is hard to guess.
- The topics it produces often mix unrelated words because it has no notion of semantic similarity.
BERTopic takes a different route. It represents each document as a dense embedding from a language model, so documents that mean similar things sit close together in vector space even when they share no exact words. It then clusters those embeddings and, only at the end, derives keywords per cluster. The key insight is that clustering and keyword extraction are separated, which makes the whole process modular and easy to customize.
The default pipeline has five stages:
Each stage is a swappable component, which is the main reason BERTopic is flexible.
Installation
pip install bertopic
BERTopic pulls in sentence-transformers, UMAP, HDBSCAN, and scikit-learn automatically. For optional features install the extras you need:
# Saving models in the safetensors format
pip install bertopic[safetensors]
Visualization support (plotly is included, but datamapplot is extra)
pip install bertopic[visualization]
LLM-based topic labels via the OpenAI backend
pip install bertopic[openai]
If you have a GPU, install a CUDA build of PyTorch first so the embedding step runs faster.
A First Model
Let's build a model on a small set of support tickets. In a real project you would load thousands of documents; the API is identical.
from bertopic import BERTopic
docs = [
"My invoice shows a charge I did not authorize this month",
"I was billed twice for the same subscription, please refund",
"The app crashes every time I open the reports page",
"Reports tab freezes and then the application closes",
"How do I reset my password? The reset email never arrives",
"I cannot log in, the password reset link is broken",
"Can you explain the difference between the Pro and Team plans?",
"What features are included in the Team plan pricing?",
# ... thousands more in practice
]
topicmodel = BERTopic()
topics, probs = topicmodel.fittransform(docs)
fittransform returns two arrays. topics holds the topic id assigned to each document, and probs holds the probability that each document belongs to its assigned topic.
Inspecting Topics
gettopicinfo() returns a DataFrame summarizing every topic: its id, how many documents it contains, a short auto-generated name, and the top keywords.
info = topicmodel.gettopicinfo()
print(info[["Topic", "Count", "Name"]])
To see the keywords and their c-TF-IDF weights for one topic, use gettopic:
print(topicmodel.gettopic(0))
[('refund', 0.041), ('billed', 0.038), ('invoice', 0.033), ...]