RAGAS: Evaluation Framework for RAG Pipelines
Introduction
Retrieval-Augmented Generation (RAG) has become the standard architecture for building knowledge-grounded LLM applications. However, building a RAG pipeline alone is not enough. We need a systematic way to measure the quality of the pipeline's output. This is where RAGAS (Retrieval Augmented Generation Assessment) comes in.
RAGAS is an open-source framework specifically designed for automatically evaluating RAG pipelines. It provides a set of metrics that measure various quality aspects, from the relevance of retrieved context to the faithfulness of answers to source documents.
In this tutorial, we will learn how to install and use RAGAS, understand its core metrics, create evaluation datasets, integrate it with LangChain, and apply it to a practical example of evaluating and improving a document QA system.
Installation
Install RAGAS and required dependencies using pip:
pip install ragas langchain langchain-openai chromadb
Make sure you have your OpenAI API key configured as an environment variable:
export OPENAIAPIKEY="sk-your-api-key-here"
RAGAS uses an LLM (OpenAI by default) to compute some of its metrics, so this API key is required for evaluation.
Core RAGAS Metrics
RAGAS provides four main metrics, each measuring a different aspect of the RAG pipeline:
1. Faithfulness
Faithfulness measures how consistent the generated answer is with the provided context. This metric ensures that the LLM does not "hallucinate" or add information that is not present in the source documents.
from ragas.metrics import faithfulness
Score 1.0 = all claims in the answer are supported by the context
Score 0.0 = no claims are supported by the context
Faithfulness is calculated by:
- Breaking down the answer into individual claims
- Verifying each claim against the provided context
- Computing the ratio of supported claims to total claims
2. Answer Relevancy
Answer Relevancy measures how relevant the generated answer is to the question asked. Answers that are complete and directly address the question receive higher scores.
from ragas.metrics import answerrelevancy
High score = answer is highly relevant to the question
Low score = answer is irrelevant or too generic
3. Context Precision
Context Precision measures how precise the retrieved context is. This metric evaluates whether relevant document chunks are ranked higher than irrelevant ones.
from ragas.metrics import contextprecision
High score = relevant context appears at the top
Low score = relevant context is mixed with irrelevant ones
4. Context Recall
Context Recall measures how much of the information needed to answer the question was successfully retrieved from the documents. This metric requires ground truth (reference answers) for computation.
from ragas.metrics import contextrecall
High score = all necessary information was retrieved
Low score = important information was missed
Creating Evaluation Datasets
To run a RAGAS evaluation, you need to prepare a dataset in the correct format. The dataset must contain: questions, retrieved contexts, generated answers, and ground truth (optional, but required for context recall).
from datasets import Dataset
questions = [
"What is machine learning?",
"How do neural networks work?",
"What is the difference between supervised and unsupervised learning?"
]
groundtruths = [
"Machine learning is a branch of artificial intelligence that enables systems to learn from data without being explicitly programmed.",
"Neural networks work by mimicking how neurons in the human brain function, using layers of interconnected nodes to process information.",
"Supervised learning uses labeled data for training, while unsupervised learning finds patterns in data without labels."
]
contexts = [
["Machine learning is a subset of artificial intelligence that focuses on developing algorithms that can learn from data. ML systems can automatically improve their performance through experience."],