RAGAS: Evaluation Framework for RAG Pipelines

Introduction

Retrieval-Augmented Generation (RAG) has become the standard architecture for building knowledge-grounded LLM applications. However, building a RAG pipeline alone is not enough. We need a systematic way to measure the quality of the pipeline's output. This is where RAGAS (Retrieval Augmented Generation Assessment) comes in.

RAGAS is an open-source framework specifically designed for automatically evaluating RAG pipelines. It provides a set of metrics that measure various quality aspects, from the relevance of retrieved context to the faithfulness of answers to source documents.

In this tutorial, we will learn how to install and use RAGAS, understand its core metrics, create evaluation datasets, integrate it with LangChain, and apply it to a practical example of evaluating and improving a document QA system.

Installation

Install RAGAS and required dependencies using pip:

pip install ragas langchain langchain-openai chromadb

Make sure you have your OpenAI API key configured as an environment variable:

export OPENAIAPIKEY="sk-your-api-key-here"

RAGAS uses an LLM (OpenAI by default) to compute some of its metrics, so this API key is required for evaluation.

Core RAGAS Metrics

RAGAS provides four main metrics, each measuring a different aspect of the RAG pipeline:

1. Faithfulness

Faithfulness measures how consistent the generated answer is with the provided context. This metric ensures that the LLM does not "hallucinate" or add information that is not present in the source documents.

from ragas.metrics import faithfulness

Score 1.0 = all claims in the answer are supported by the context
Score 0.0 = no claims are supported by the context

Faithfulness is calculated by:

Breaking down the answer into individual claims
Verifying each claim against the provided context
Computing the ratio of supported claims to total claims

2. Answer Relevancy

Answer Relevancy measures how relevant the generated answer is to the question asked. Answers that are complete and directly address the question receive higher scores.

from ragas.metrics import answerrelevancy

High score = answer is highly relevant to the question
Low score = answer is irrelevant or too generic

3. Context Precision

Context Precision measures how precise the retrieved context is. This metric evaluates whether relevant document chunks are ranked higher than irrelevant ones.

from ragas.metrics import contextprecision

High score = relevant context appears at the top
Low score = relevant context is mixed with irrelevant ones

4. Context Recall

Context Recall measures how much of the information needed to answer the question was successfully retrieved from the documents. This metric requires ground truth (reference answers) for computation.

from ragas.metrics import contextrecall

High score = all necessary information was retrieved
Low score = important information was missed

Creating Evaluation Datasets

To run a RAGAS evaluation, you need to prepare a dataset in the correct format. The dataset must contain: questions, retrieved contexts, generated answers, and ground truth (optional, but required for context recall).

from datasets import Dataset

questions = [
    "What is machine learning?",
    "How do neural networks work?",
    "What is the difference between supervised and unsupervised learning?"
]

groundtruths = [
    "Machine learning is a branch of artificial intelligence that enables systems to learn from data without being explicitly programmed.",
    "Neural networks work by mimicking how neurons in the human brain function, using layers of interconnected nodes to process information.",
    "Supervised learning uses labeled data for training, while unsupervised learning finds patterns in data without labels."
]

contexts = [
    ["Machine learning is a subset of artificial intelligence that focuses on developing algorithms that can learn from data. ML systems can automatically improve their performance through experience."],

RAGAS: Evaluation Framework for RAG Pipelines

RAGAS: Evaluation Framework for RAG Pipelines

Introduction

Installation

Core RAGAS Metrics

1. Faithfulness

Score 1.0 = all claims in the answer are supported by the context

Score 0.0 = no claims are supported by the context

2. Answer Relevancy

High score = answer is highly relevant to the question

Low score = answer is irrelevant or too generic

3. Context Precision

High score = relevant context appears at the top

Low score = relevant context is mixed with irrelevant ones

4. Context Recall

High score = all necessary information was retrieved

Low score = important information was missed

Creating Evaluation Datasets

Related Articles

Advanced RAG Tutorial: Hybrid Search, Reranking, and Evaluation

Complete LlamaIndex Tutorial: Building RAG Applications with LLMs

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

Related Articles

Advanced RAG Tutorial: Hybrid Search, Reranking, and Evaluation

RAG Tingkat Lanjut - Membangun Retrieval-Augmented Generation Kelas Produksi Daftar Isi Pendahuluan Prasyarat Instalasi ...

Complete LlamaIndex Tutorial: Building RAG Applications with LLMs

Tutorial Lengkap LlamaIndex: Membangun Aplikasi RAG dengan LLM LlamaIndex adalah framework data yang powerful untuk memb...

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

Post-Training LLM dengan TRL: SFT, Reward Modeling, dan DPO Setelah sebuah base language model selesai dipretraining, mo...