Sentence Transformers: Embeddings, Semantic Similarity, and Rerankers
Sentence Transformers (often called SBERT) is a Python library for turning text into dense vector embeddings that capture meaning rather than surface words. In this tutorial we build a small retrieval system over a toy corpus, measure semantic similarity, add a cross-encoder reranker, and then fine-tune our own embedding model with the modern training API. The goal is a practical, end-to-end view of how the library fits into a real search or RAG pipeline.
What Sentence Transformers Is
The sentence-transformers library wraps transformer models (BERT, RoBERTa, MPNet, and many others) and adds pooling so that an entire sentence or paragraph maps to a single fixed-length vector. Two texts with similar meaning produce vectors that are close together, which lets you compare them with cosine similarity instead of keyword matching.
The library is maintained alongside the Hugging Face ecosystem, so models load from the Hub, datasets use the datasets format, and trained models push back to the Hub with one call. It is the standard tool for building embedding-based search, clustering, deduplication, and the retrieval stage of RAG systems.
Bi-Encoders vs Cross-Encoders
There are two model families in the library, and choosing correctly is the single most important design decision.
A bi-encoder encodes each text independently into a vector. You embed your whole corpus once, store the vectors, and at query time you embed only the query and compare it against the stored vectors. This is fast and scales to millions of documents because comparison is just a dot product. The trade-off is accuracy: the model never sees the query and document together, so it can miss subtle interactions.
A cross-encoder takes a pair of texts at once (query and candidate) and outputs a single relevance score. Because the model attends across both texts jointly, it is far more accurate. The cost is that you cannot precompute anything: every query-document pair must be run through the model. Scoring a query against a million documents is not feasible.
The standard pattern combines both. The bi-encoder retrieves a few dozen candidates quickly, then the cross-encoder reranks just those candidates for precision. This is the retrieve-then-rerank pattern we build later.
Query --> [Bi-encoder] --> top 50 candidates --> [Cross-encoder] --> top 5 reranked
(fast, approximate) (slow, precise)
Installation
Install the library with pip. It pulls in PyTorch, transformers, and datasets as dependencies.
pip install -U sentence-transformers
For training and evaluation you may also want a few extras. Installing accelerate enables faster and multi-GPU training, and datasets is required for the training API (it usually comes in already).
pip install -U accelerate datasets
Verify the install and check which device is available.
import torch
from sentencetransformers import SentenceTransformer
print("sentence-transformers ready")
print("CUDA available:", torch.cuda.isavailable())
Loading a Model and Encoding Text
The core class is SentenceTransformer. Pass it a model name from the Hub and it downloads and caches the weights. A good general-purpose starting model is all-MiniLM-L6-v2: small, fast, 384-dimensional, and strong on English semantic similarity.
from sentencetransformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"How do I reset my account password?",
"Steps to recover a forgotten login",
"The weather in Jakarta is hot today.",
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, 384)