ColBERT & RAGatouille: Late-Interaction Retrieval for Better RAG
Most RAG systems lean on single-vector dense embeddings, where an entire passage is compressed into one vector. That compression is convenient and fast, but it throws away token-level detail that often decides whether a retrieval is actually relevant. This tutorial explains how ColBERT's late-interaction approach keeps that detail, and how RAGatouille lets you adopt it without rewriting your stack.
The Problem with Single-Vector Dense Retrieval
A standard bi-encoder (think Sentence-Transformers feeding FAISS or pgvector) encodes a query into one vector and each document into one vector, then ranks by cosine similarity. This is what the semantic-search, FAISS, and Sentence-Transformers tutorials cover, and it works well for many cases.
The weakness is the information bottleneck. A 512-token passage about, say, "warranty claims for diesel engines in mining trucks" gets averaged down into a single 768-dimensional vector. Specific terms and their relationships are smeared together. When a query asks about one precise detail ("what is the warranty period for the turbocharger?"), the single document vector may not surface it, because the turbocharger detail was diluted by everything else in the passage.
Three failure modes show up repeatedly:
- Lexical precision is lost. Rare but critical terms (part numbers, drug names, statute references) get averaged away.
- Long passages degrade. The longer the chunk, the more the single vector becomes a blurry summary.
- Out-of-domain queries suffer. A generic bi-encoder has no token-level signal to fall back on.
Three Retrieval Architectures
It helps to place ColBERT against the two architectures you already know.
Bi-encoder (single-vector dense)
Query and document are encoded independently into one vector each. Similarity is a single dot product.
- Quality: good, ceiling limited by the bottleneck.
- Latency: very low at query time (one vector, ANN search).
- Storage: small (one vector per document).
Cross-encoder
Query and document are concatenated and fed through the model together, producing a single relevance score. The model can attend across every query-document token pair.
- Quality: highest.
- Latency: very high. You must run the model once per candidate document, so it cannot scan a corpus; it only reranks a short list.
- Storage: nothing precomputed (and that is the problem at scale).
Late interaction (ColBERT)
ColBERT keeps one embedding per token for both the query and the document. At scoring time, every query-token embedding is matched against all document-token embeddings, the best match per query token is kept (MaxSim), and these are summed.
- Quality: close to a cross-encoder, well above a bi-encoder.
- Latency: moderate. Document token embeddings are precomputed and indexed; only the MaxSim aggregation happens at query time.
- Storage: larger (many vectors per document), which is the main cost.
The key insight: ColBERT defers the query-document interaction until after both are encoded ("late"), so documents can still be precomputed and indexed, yet the comparison remains token-level.
How MaxSim Scoring Works
Given a query with token embeddings q1 ... qn and a document with token embeddings d1 ... dm, the ColBERT score is:
score(Q, D) = sum over i of ( max over j of ( qi . dj ) )
For each query token, find the single document token it matches best (the max), then add up those best matches across all query tokens. A query token for "turbocharger" can latch onto the one document token that mentions turbochargers, even if the rest of the passage is about something else. That is exactly the signal a single averaged vector loses.
A small illustration of the operation in NumPy (conceptual, not the real engine):
import numpy as np
def maxsim(querytokenemb, doctokenemb):
# querytokenemb: (nq, dim), doctokenemb: (nd, dim)
# assume L2-normalized embeddings, so dot product == cosine
sim = querytokenemb @ doctokenemb.T # (nq, nd)
perquerybest = sim.max(axis=1) # best doc token per query token