BERTopic Tutorial: Modern Topic Modeling with Embeddings

# BERTopic: Pemodelan Topik Modern dengan Embedding BERTopic adalah library pemodelan topik yang menggabungkan embedding transformer, reduksi dimensi, clustering berbasis kepadatan, dan representasi...

By Ruby Abdullah · · tutorial
BERTopicTopic ModelingNLPClusteringEmbeddingsPython

BERTopic: Modern Topic Modeling with Embeddings

BERTopic is a topic modeling library that combines transformer embeddings, dimensionality reduction, density-based clustering, and a class-based TF-IDF representation to discover themes in text collections. Instead of treating documents as bags of words like classical methods, it works on the semantic level, which usually produces more coherent and interpretable topics. This tutorial walks through the full BERTopic pipeline using a practical example: clustering a corpus of customer support tickets.

What Topic Modeling Is and Why BERTopic Helps

Topic modeling is the task of automatically grouping a collection of documents into themes ("topics") and describing each theme with a set of representative words. It is unsupervised: you do not provide labels, the model finds structure on its own.

The traditional approach is Latent Dirichlet Allocation (LDA). LDA treats each document as a mixture of topics and each topic as a distribution over words. It works on raw word counts, so it ignores word order and meaning. In practice this leads to a few recurring problems:

  • It struggles with short texts (tweets, tickets, reviews) where word co-occurrence is sparse.
  • It requires you to fix the number of topics up front, which is hard to guess.
  • The topics it produces often mix unrelated words because it has no notion of semantic similarity.

BERTopic takes a different route. It represents each document as a dense embedding from a language model, so documents that mean similar things sit close together in vector space even when they share no exact words. It then clusters those embeddings and, only at the end, derives keywords per cluster. The key insight is that clustering and keyword extraction are separated, which makes the whole process modular and easy to customize.

The default pipeline has five stages:

  • Embed documents with a sentence-transformer model.
  • Reduce the embedding dimensionality with UMAP.
  • Cluster the reduced embeddings with HDBSCAN.
  • Tokenize the documents per cluster with a vectorizer (CountVectorizer).
  • Weight the tokens with class-based TF-IDF (c-TF-IDF) to find topic keywords.
  • Each stage is a swappable component, which is the main reason BERTopic is flexible.

    Installation

    pip install bertopic
    

    BERTopic pulls in sentence-transformers, UMAP, HDBSCAN, and scikit-learn automatically. For optional features install the extras you need:

    # Saving models in the safetensors format
    

    pip install bertopic[safetensors]

    Visualization support (plotly is included, but datamapplot is extra)

    pip install bertopic[visualization]

    LLM-based topic labels via the OpenAI backend

    pip install bertopic[openai]

    If you have a GPU, install a CUDA build of PyTorch first so the embedding step runs faster.

    A First Model

    Let's build a model on a small set of support tickets. In a real project you would load thousands of documents; the API is identical.

    from bertopic import BERTopic
    
    

    docs = [

    "My invoice shows a charge I did not authorize this month",

    "I was billed twice for the same subscription, please refund",

    "The app crashes every time I open the reports page",

    "Reports tab freezes and then the application closes",

    "How do I reset my password? The reset email never arrives",

    "I cannot log in, the password reset link is broken",

    "Can you explain the difference between the Pro and Team plans?",

    "What features are included in the Team plan pricing?",

    # ... thousands more in practice

    ]

    topicmodel = BERTopic()

    topics, probs = topicmodel.fittransform(docs)

    fittransform returns two arrays. topics holds the topic id assigned to each document, and probs holds the probability that each document belongs to its assigned topic.

    Inspecting Topics

    gettopicinfo() returns a DataFrame summarizing every topic: its id, how many documents it contains, a short auto-generated name, and the top keywords.
    info = topicmodel.gettopicinfo()
    

    print(info[["Topic", "Count", "Name"]])

    To see the keywords and their c-TF-IDF weights for one topic, use gettopic:

    print(topicmodel.gettopic(0))
    

    [('refund', 0.041), ('billed', 0.038), ('invoice', 0.033), ...]

    Related Articles

    Sentence Transformers Tutorial: Embeddings, Similarity, and Rerankers

    Sentence Transformers: Embedding, Kemiripan Semantik, dan Reranker Sentence Transformers (sering disebut SBERT) adalah p...

    spaCy Tutorial: Industrial-Strength NLP in Python

    spaCy: NLP Kelas Industri di Python spaCy adalah pustaka open-source untuk pemrosesan bahasa alami (NLP) yang dirancang ...

    FAISS Tutorial: Efficient Vector Similarity Search at Scale

    FAISS: Pencarian Kemiripan Vektor yang Efisien dalam Skala Besar FAISS (Facebook AI Similarity Search) adalah library C+...

    Complete Label Studio Tutorial: Data Labeling for Machine Learning

    Tutorial Lengkap Label Studio: Data Labeling untuk Machine Learning Label Studio adalah platform data labeling open-sour...