ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

# ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense embedding satu-vektor, di mana seluruh paragraf dipadatkan menjadi satu vekt...

By Ruby Abdullah · · tutorial
ColBERTRAGatouilleRetrievalRAGLate InteractionPython

ColBERT & RAGatouille: Late-Interaction Retrieval for Better RAG

Most RAG systems lean on single-vector dense embeddings, where an entire passage is compressed into one vector. That compression is convenient and fast, but it throws away token-level detail that often decides whether a retrieval is actually relevant. This tutorial explains how ColBERT's late-interaction approach keeps that detail, and how RAGatouille lets you adopt it without rewriting your stack.

The Problem with Single-Vector Dense Retrieval

A standard bi-encoder (think Sentence-Transformers feeding FAISS or pgvector) encodes a query into one vector and each document into one vector, then ranks by cosine similarity. This is what the semantic-search, FAISS, and Sentence-Transformers tutorials cover, and it works well for many cases.

The weakness is the information bottleneck. A 512-token passage about, say, "warranty claims for diesel engines in mining trucks" gets averaged down into a single 768-dimensional vector. Specific terms and their relationships are smeared together. When a query asks about one precise detail ("what is the warranty period for the turbocharger?"), the single document vector may not surface it, because the turbocharger detail was diluted by everything else in the passage.

Three failure modes show up repeatedly:

  • Lexical precision is lost. Rare but critical terms (part numbers, drug names, statute references) get averaged away.
  • Long passages degrade. The longer the chunk, the more the single vector becomes a blurry summary.
  • Out-of-domain queries suffer. A generic bi-encoder has no token-level signal to fall back on.

Three Retrieval Architectures

It helps to place ColBERT against the two architectures you already know.

Bi-encoder (single-vector dense)

Query and document are encoded independently into one vector each. Similarity is a single dot product.

  • Quality: good, ceiling limited by the bottleneck.
  • Latency: very low at query time (one vector, ANN search).
  • Storage: small (one vector per document).

Cross-encoder

Query and document are concatenated and fed through the model together, producing a single relevance score. The model can attend across every query-document token pair.

  • Quality: highest.
  • Latency: very high. You must run the model once per candidate document, so it cannot scan a corpus; it only reranks a short list.
  • Storage: nothing precomputed (and that is the problem at scale).

Late interaction (ColBERT)

ColBERT keeps one embedding per token for both the query and the document. At scoring time, every query-token embedding is matched against all document-token embeddings, the best match per query token is kept (MaxSim), and these are summed.

  • Quality: close to a cross-encoder, well above a bi-encoder.
  • Latency: moderate. Document token embeddings are precomputed and indexed; only the MaxSim aggregation happens at query time.
  • Storage: larger (many vectors per document), which is the main cost.

The key insight: ColBERT defers the query-document interaction until after both are encoded ("late"), so documents can still be precomputed and indexed, yet the comparison remains token-level.

How MaxSim Scoring Works

Given a query with token embeddings q1 ... qn and a document with token embeddings d1 ... dm, the ColBERT score is:

score(Q, D) = sum over i of ( max over j of ( qi . dj ) )

For each query token, find the single document token it matches best (the max), then add up those best matches across all query tokens. A query token for "turbocharger" can latch onto the one document token that mentions turbochargers, even if the rest of the passage is about something else. That is exactly the signal a single averaged vector loses.

A small illustration of the operation in NumPy (conceptual, not the real engine):

import numpy as np

def maxsim(querytokenemb, doctokenemb):

# querytokenemb: (nq, dim), doctokenemb: (nd, dim)

# assume L2-normalized embeddings, so dot product == cosine

sim = querytokenemb @ doctokenemb.T # (nq, nd)

perquerybest = sim.max(axis=1) # best doc token per query token

Related Articles

Firecrawl: Web Scraping API for AI and LLM Applications

Firecrawl: Web Scraping API untuk Aplikasi AI dan LLM Firecrawl adalah managed web scraping API yang dirancang khusus un...

RAGAS: Evaluation Framework for RAG Pipelines

RAGAS: Framework Evaluasi untuk Pipeline RAG Pendahuluan Retrieval-Augmented Generation (RAG) telah menjadi arsitektur s...

Docling: Smart Document Parsing for AI and RAG Pipelines

Docling: Document Parsing Cerdas untuk Pipeline AI dan RAG Dalam era AI generatif, kemampuan untuk mengekstrak informasi...

Haystack Tutorial: NLP Framework for Production

Haystack - Framework NLP untuk Produksi Daftar Isi Pendahuluan Prasyarat Memahami Arsitektur Haystack [Document Store].....